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A METHOD FOR IDENTIFYING A SUBSET OF COMPONENTS OF A SYSTEM 
FIELD OF THE INVENTION 

The present invention relates to a method and apparatus for 
identifying components of a system from data generated from 
samples from the system, which components are capable of 
predicting a feature of the sample within the system and, 
particularly, but not exclusively, the present invention 
relates to a method and apparatus for identifying components 
of a biological system from data generated by a biological 
method, which components are capable of predicting a feature 
of interest associated with a sample applied to the 
biological system. 

BACKGROUND OF THE INVENTION 

There are any number of systems in existence that can be 
classified according to one or more features thereof. The 
term "system" as used throughout this specification is 
considered to include all types of systems from which data 
(e.g. statistical data) can be obtained. Examples of such 
systems include chemical systems, financial systems and 
geological systems. It is desirable to be able to utilise 
data obtained from the systems to identify particular 
features of samples from the system; for instance, to assist 
with analysis of financial system to identify groups such as 
those who have good credit and those who are a credit risk. 
Often the data obtained from the systems is relatively large 
and therefore it is desirable to identify components of the 
systems from the data, the components being predictive of 
the particular features of the samples from the system. 
However, when the data is relatively large it can be 
difficult to identify the components because there is a 
large amount of data to process, the majority of which may 
not provide any indication or little indication of the 
features of a particular sample from which the data is 
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taken. Furthermore, components that are identified using a 
training sample are often ineffective at identifying 
features on test sample data when the test sample data has a 
high degree of variability relative to the training sample 
data. This is often the case in situations when, for 
example, data is obtained from many different sources, as it 
is often difficult to control the conditions under which the 
data is collected from each individual source. 

An example of a type of system where these problems are 
particularly pertinent, is a biological system, in which the 
components could include, for example, particular genes or 
proteins. Recent advances in biotechnology have resulted in 
the development of biological methods for large scale 
screening of systems and analysis of samples. Such methods 
include, for example, microarray analysis using DNA or RNA, 
proteomics analysis, proteomics electrophoresis gel 
analysis, and high throughput screening techniques. These 
types of methods often result in the generation of data that 
can have up to 30,000 or more components for each sample 
that is tested. 



It is highly desirable to be able identify features of 
interest in samples from biological systems. For example, 
to classify groups such as "diseased" and "non-diseased" . 
Many of these biological methods would be useful as 
diagnostic tools predicting features of a sample in the 
biological systems. For example, identifying diseases by 
screening tissues or body fluids, or as tools for 
determining, for example, the efficacy of pharmaceutical 
compounds . 

Use of biological methods such as biotechnology arrays in 
such applications to date has been limited due to the large 
amount of data that is generated from these types of 
methods, and the lack of efficient methods for screening the 
data for meaningful results. Consequently, analysis of 
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biological data using existing methods is time consuming, 
prone to false results and requires large amounts of 
computer memory if a meaningful result is to be obtained 
from the data. This is problematic in large scale screening 
5 scenarios where rapid and accurate screening is required. 

It is therefore desirable to have a method, in particular 
for analysis of biological data, and more generally, for an 
improved method of analysing data from a system in order to 
10 predict a feature of interest for a sample from the system. 

SUMMARY OF THE INVENTION 

According to a first aspect of the present invention, there 
15 is provided a method of identifying a subset of components 
of a system based on data obtained from the system using at 
least one training sample from the system, the method 
comprising the steps of: 

obtaining a linear combination of components of the 
20 system and weightings of the linear combination of 

components, the weightings having values based on the data 
obtained from the system using the at least one training 
sample, the at least one training sample having a known 
feature ; 

25 obtaining a model of a probability distribution of the 

known feature, wherein the model is conditional on the 

linear combination of components; 

obtaining a prior distribution for the weighting of 

the linear combination of the components, the prior 
3 0 distribution comprising a hyperprior having a high 

probability density close to zero, the hyperprior being such 

that it is not a Jeffreys hyperprior; 

combining the prior distribution and the model to 

generate a posterior distribution; and 
35 identifying the subset of components based on a set of 

the weightings that maximise the posterior distribution. 
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The method utilises training samples having the known 
feature in order to identify the subset of components which 
can predict a feature for a training sample. Subsequently, 
knowledge of the subset of components can be used for tests, 
5 for example clinical tests, to predict a feature such as 
whether a tissue sample is malignant or benign, or what is 
the weight of a tumour, or provide an estimated time for 
survival of a patient having a particular condition. 



10 The term "feature" as used throughout this specification 
refers to any response or identifiable trait or character 
that is associated with a sample. For example, a feature 
may be a particular time to an event for a particular 
sample, or the size or quantity of a sample, or the class or 

15 group into which a sample can be classified. 

Preferably, the step of obtaining the linear combination 
comprises the step of using a Bayesian statistical method to 
estimate the weightings. 

20 

Preferably, the method further comprises the step of making 
an apriori assumption that a majority of the components are 
unlikely to be components that will form part of the subset 
of components . 

25 

The apriori assumption has particular application when there 
are a large amount of components obtained from the system. 
The apriori assumption is essentially that the majority of 
the weightings are likely to be zero. The model is 

30 constructed such that with the apriori assumption in mind, 
the weightings are such that the posterior probability of 
the weightings given the observed data is maximised. 
Components having a weighting below a pre -determined 
threshold (which will be the majority of them in accordance 

35 with the apriori assumption) are ignored. The process is 
iterated until the correct diagnostic components are 
identified. Thus, the method has the potential to be quick. 
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mainly because of the apriori assumption, which results in 
rapid elimination of the majority of components. 

Preferably, the hyperprior comprises one or more adjustable 
5 parameter that enable the prior distribution near zero to be 
varied. 

Most features of a system typically exhibit a probability 
distribution, and the probability distribution of a feature 

10 can be modelled using statistical models that are based on 
the data generated from the training samples. The present 
invention utilises statistical models that model the 
probability distribution for a feature of interest or a 
series of features of interest. Thus, for a feature of 

15 interest having a particular probability distribution, an 

appropriate model is defined that models that distribution. 

Preferably, the method comprise a mathematical equation in 
the form of a likelihood function that provides the 
20 probability distribution based on data obtained from the at 
least one training sample. 

Preferably, the likelihood function is based on a previously 
described model for describing some probability 
25 distribution. 

Preferably, the step of obtaining the model comprises the 
step of selecting the model from a group comprising a 
multinomial or binomial logistic regression, generalised 
30 linear model. Cox's proportional hazards model, accelerated 
failure model and parametric survival model. 

In a first embodiment, the likelihood function is based on 
the multinomial or binomial logistic regression. The 
35 binomial or multinomial logistic regression preferably 
models a feature having a multinomial or binomial 
distribution. A binomial distribution is a statistical 
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distribution having two possible classes or groups such as 
an on/off state. Examples of such groups include 
dead/alive, improved/not improved, depressed/not depressed. 
A multinomial distribution is a generalisation of the 
binomial distribution in which a plurality of classes or 
groups are possible for each of a plurality of samples, or 
in other words, a sample may be classified into one of a 
plurality of classes or groups. Thus, by defining a 
likelihood function based on a multinomial or binomial 
logistic regression, it is possible to identify subsets of 
components that are capable of classifying a sample into one 
of a plurality of pre-defined groups or classes. To do 
this, training samples are grouped into a plurality of 
sample groups (or "classes") based on a predetermined 
feature of the training samples in which the members of each 
sample group have a common feature and are assigned a common 
group identifier. A likelihood function is formulated based 
on a multinomial or binomial logistic regression conditional 
on the linear combination (which incorporates the data 
generated from the grouped training samples) . The feature 
may be any desired classification by which the training 
samples are to be grouped. For example, the features for 
classifying tissue samples may be that the tissue is normal, 
malignant, benign, a leukemia cell, a healthy cell, that the 
training samples are obtained from the blood of patients 
having or not having a certain condition, or that the 
training samples are from a cell from one of several types 
of cancer as compared to a normal cell. 

In the first embodiment, the likelihood function based on 
the multinomial or binomial logistic regression is of the 
form: 
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wherein 

xfpg is a linear combination generated from input data from 
training sample i with component weights p gt 

xf is the components for the i th Row of X and p 9 is a set of 
component weights for sample class g; and 

X is data from n training samples comprising p components 
and the e 1>c are defined further in this specification. 

In a second embodiment, the likelihood function is based on 
the ordered categorical logistic regression. The ordered 
categorical logistic regression models a binomial or 
multinomial distribution in which the classes are in a 
particular order (ordered classes such as for example, 
classes of increasing or decreasing disease severity) . By 
defining a likelihood function based on an ordered 
categorical logistic regression, it is possible to identify 
a subset of components that is capable of classifying a 
sample into a class wherein the class is one of a plurality 
of predefined ordered classes. By defining a series of 
group indentifiers in which each group identifier 
corresponds to a member of an ordered class, and grouping 
the training samples into one of the ordered classes based 
on predetermined features of the training samples, a 
likelihood function can be formulated based on a categorical 
ordered logistic regression which is conditional on the 
linear combination (which incorporates the data generated 
from the grouped training samples) . 

In the second embodiment, the likelihood function based on 
the categorical ordered logistic regression is of the form: 



N c -' f Y'* ( V 

<=nn -N y^A 



Wherein 

y ik is the probability that training sample i belongs to a 
class with identifier less than or equal to k (where the 
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total of ordered classes is G) .The r ± is defined further in 
the document. 

In a third embodiment of the present invention, the 
likelihood function is based on the generalised linear 
model. The generalised linear model preferably models a 
feature that is distributed as a regular exponential family 
of distributions. Examples of regular exponential family of 
distributions include normal distribution, guassian 
distribution, poisson distribution, gamma distribution and 
inverse gaussian distribution. Thus, in another embodiment 
of the method of the invention, a subset of components is 
identified that is capable of predicting a predefined 
characteristic of a sample which has a distribution 
belonging to a regular exponential family of distributions. 
In particular by defining a generalised linear model which 
models the characteristic to be predicted. Examples of a 
characteristic that may be predicted using a generalised 
linear model include any quantity of a sample that exhibits 
the specified distribution such as, for example, the weight, 
size or other dimensions or quantities of a sample. 

In the third embodiment, the generalised linear model is of 
the form: 

L = log p(y | 0, 0) =± { M^m +cM) } 
«=i 0,(0) 

where y = (y 1# „, y n ) T and a 4 (0) = 0 / Wl with the w t being a 
fixed set of known weights and 0 a single scale parameter. 
The other terms in this expression are defined later in this 
document . 

In a fourth embodiment, the method of the present invention 
may be used to predict the time to an event for a sample by 
utilising the likelihood function that is based on a hazard 
model, which preferably estimates the probability of a time 
to an event given that the event has not taken place at the 
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time of obtaining the data. In the fourth embodiment, the 
likelihood function is selected from the group comprising a 
Cox's proportional hazards model, parametric survival model 
and accelerated failure times model. Cox's proportional 
hazards model permits the time to an event to be modelled on 
a set of components and component weights without making 
restrictive assumptions about time. The accelerated failure 
model is a general model for data consisting of survival 
times in which the component measurements are assumed to act 
multiplicatively on the time- scale, and so affect the rate 
at which an individual proceeds along the time axis. Thus, 
the accelerated survival model can be interpreted in terms 
of the speed of progression of, for example, disease. The 
parametric survival model is one in which the distribution 
function for the time to an event (eg survival time) is 
modelled by a known distribution or has a specified 
parametric formulation. Among the commonly used survival 
distributions are the Weibull, exponential and extreme value 
distributions . 



In the fourth embodiment, a subset of components capable of 
predicting the time to an event for a sample is identified 
by defining a likelihood based on Cox's proportional 
standards model, a parametric survival model or an 
accelerated survival times model, which comprises measuring 
the time elapsed for a plurality of samples from the time 
the sample is obtained to the time of the event. 

In the fourth embodiment, the likelihood function for 
predicting the time to an event is of the form: 

N 

Log ( Partial) Likelihood = £ 8i(P.9'.X ,y,c) 



where fij =(Pi.fi 2 -.P p ) and <p< ^.q^,- ,<P q Jare the model 
parameters, y is a vector of observed times and c is an 
indicator vector which indicates whether a time is a true 
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survival time or a censored survival time. 

In the fourth embodiment, the likelihood function based on 
Cox's proportional hazards model is of the form: 



r 



N 

n 



exp (z jP ) 
£ exp (Z t p_ ) 
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where the observed times are be ordered in increasing 
magnitude denoted as £ = ('( 7) ,* (2) , -t {N) ), / (f+/) >/ (f) . and Z denotes 

the Nxp matrix that is the re -arrangement of the rows of X 
where the ordering of the rows of Z corresponds to the 
ordering induced by the ordering of / . Also P T =(p\,P 2 , -,P n ) , 
Zi = the j th row of Z. and <R y - = {i:i = JJ + l,~.,N}= the risk set 
at the j* th ordered event time /(y) . 
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In the fourth embodiment, wherein the likelihood function is 
based on the Parametric Survival model it is of the form: 



N 
i=l 



where Mi =A(y i ;<p)exp(X i p) and A denotes the integrated 
parametric hazard function. 

Por any defined models, the weightings are typically 
estimated using a Bayesian statistical model (Kotz and 
Johnson, 1983) in which a posterior distribution of the 
component weights is formulated which combines the 
likelihood function and a prior distribution. The component 
weightings are estimated by maximising the posterior 
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distribution of the weightings given the data generated for 
the at least one training sample. Thus, the objective 
function to be maximised consists of the likelihood function 
based on a model for the feature as discussed above and a 
5 prior distribution for the weightings. 

Preferably, the prior distribution is of the form: 



10 wherein v is a p x 1 vector of hyperparameters, and where 



distribution for v 2 . 

Preferably, the hyperprior comprises a gamma distribution 
15 with a specified shape and scale parameter. 

This hyperprior distribution (which is preferably the same 
for all embodiments of the method) may be expressed using 
different notational conventions, and in the detailed 
20 description of the embodiments (see below) , the following 
notational conventions are adopted merely for convenience 
for the particular embodiment: 

As used herein, when the likelihood function for the 
25 probability distribution is based on a multinomial or 

binomial logistic regression, the notation for the prior 
distribution is: 






30 where ./£_,) and r' =(<,..., 
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and p{p g \*l) is Af(0,diag{rj}) and />(t 2 ) is some hyperprior 
distribution for t\ . 

As used herein, when the likelihood function for the 
probability distribution is based on a categorical ordered 
logistic regression, the notation for the prior distribution 
is: 

P{PuP 2 ,~,P n ) = \f\P{Pi\v?)P{vt)dv> 

where Pi,P 2 ,-,P„ are component weights, P(P i \v i ) is N(0,vf jand 
P[v, J some hyperprior distribution for v*. 

As used herein, when the likelihood function for the 
distribution is based on a generalised linear model, the 
notation for the prior distribution is: 

p(P)= \p(P \y 2 )p(v 2 )dv 2 

wherein v is a p x 1 vector of hyperparameters, and where 
p(P \y 2 ) is AT(o,diag{v 2 }) and p(v 2 ) is some prior distribution 
for v 2 . 

As used herein, when the likelihood function for the 
distribution is based on a hazard model, the notation for 
the prior distribution is: 



where p(p'\t ) is Af(o,diag{T 2 }) and p(t ) some hyperprior 
distribution for r . 
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The prior distribution comprises a hyperprior that ensures 
that zero weightings are used whenever possible. 

p(fi') m J p(p'V )p( T ) dr 

T 

In an alternative embodiment/ the hyperprior is an inverse 
gamma distribution in which each if =\lvf has an independent 
gamma distribution. 

In a further alternative embodiment the hyperprior is a 
gamma distribution in which each vf , r, or zf (depending on the 
context) has an independent gamma distribution. 

As discussed previously , the prior distribution and the 
likelihood function are combined to generate a posterior 
distribution. The posterior distribution is preferably of 
the form: 

p(P<Pv\y)cc L(y\p<p)p(p\v)p(v) 
or 

wherein L{y\0,<p) is the likelihood function. 

Preferably, the step of identifying the subset of components 
comprises the step of using an iterative procedure such that 
the probability density of the posterior distribution is 
maximised. 

During the iterative procedure, component weightings having 
a value less than a pre -determined threshold are eliminated, 
preferably by setting those component weights to zero. This 
results in the substantially elimination of the 
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corresponding component . 

Preferably, the iterative procedure is an EM algorithm. 

5 The EM algorithm produces a sequence of component weighting 
estimates that converge to give component the weightings 
that maximise the probability density of the posterior 
distribution. The EM algorithm consists of two steps, known 
as the E or Expectation step and the M, or Maximisation 

10 step. In the E step, the expected value of the log- 
posterior function conditional on the observed data is 
determined. In the M step, the expected log-posterior 
function is maximised to give updated component weight 
estimates that increase the posterior. The two steps are 

15 alternated until convergence of the E step and the M step is 
achieved, or in other words, until the expected value and 
the maximised value of the expected log-posterior function 
converges. 

2 0 It is envisaged that the method of the present invention may 
be applied to any system from which measurements can be 
obtained, and preferably systems from which very large 
amounts of data are generated. Examples of systems to which 
the method of the present invention may be applied include 

25 biological systems, chemical systems, agricultural systems, 
weather systems, financial systems including, for example, 
credit risk assessment systems, insurance systems, marketing 
systems or company record systems, electronic systems, 
physical systems, astrophysics systems and mechanical 

30 systems. For example, in a financial system, the samples 
may be particular stock and the components may be 
measurements made on any number of factors which may affect 
stock prices such as company profits, employee numbers, 
rainfall values in various cities, number of shareholders 

35 etc. 
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The method of the present invention is particularly suitable 
for use in analysis of biological systems. The method of 
the present invention may be used to identify subsets of 
components for classifying samples from any biological 
system which produces measurable values for the components 
and in which the components can be uniquely labelled. In 
other words, the components are labelled or organised in a 
manner which allows data from one component to be 
distinguished from data from another component. For 
example, the components may be spatially organised in, for 
example, an array which allows data from each component to 
be distinguished from another by spatial position, or each 
component may have some unique identification associated 
with it such as an identification signal or tag. For 
example, the components may be bound to individual carriers, 
each carrier having a detectable identification signature 
such as quantum dots (see for example, Rosenthal, 2001, 
Nature Biotech 19s 621-622; Han et al. (2001) Nature 
Biotechnology 19: 631-635), fluorescent markers (see for 
example, Fu et al, (1999) Nature Biotechnology 17: 1109- 
1111) , bar-coded tags (see for example, Lockhart and trulson 
(2001) Nature Biotechnology 19: 1122-1123). 

In a particularly preferred embodiment, the biological 
system is a biotechnology array. Examples of biotechnology 
arrays include oligonucleotide arrays, DNA arrays, DNA 
microarrays, RNA arrays, RNA microarrays, DNA microchips, 
RNA microchips, protein arrays, protein microchips, antibody 
arrays, chemical arrays, carbohydrate arrays, proteomics 
arrays, lipid arrays. In another embodiment, the biological 
system may be selected from the group including, for 
example, DNA or RNA electrophoresis gels, protein or 
proteomics electrophoresis gels, biomolecular interaction 
analysis such as Biacore analysis, amino acid analysis, 
ADMETox screening (see for example High- throughput ADMETox 
estimation: In Vitro and In Silico approaches (2002), Ferenc 
Darvas and Gyorgy Dorman (Eds) , Biotechniques Press) , 
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protein electrophoresis gels and proteinics electrophoresis 
gels. 



The components may be any measurable component of the 
system. In the case of a biological system, the components 
may be, for example, genes or portions thereof, DNA 
sequences, RNA sequences, peptides, proteins, carbohydrate 
molecules, lipids or mixtures thereof, physiological 
components, anatomical components, epidemiological 
components or chemical components. 

The training samples may be any data obtained from a system 
in which the feature of the sample is known. For example, 
training samples may be data generated from a sample applied 
to a biological system. For example, when the biological 
system is a DNA microarray, the training sample may be data 
obtained from the array following hybridisation of the array 
with RNA extracted from cells having a known feature, or 
cDNA synthesised from the RNA extracted from cells, or if 
the biological system is a proteinics electrophoresis gel, 
the training sample may be generated from a protein or cell 
extract applied to the system. 

It is envisaged that an embodiment of a method of the 
present invention may be used in re -evaluating or evaluating 
test data from subjects who have presented mixed results in 
response to a test treatment. Thus, there is a second 
aspect to the present invention. 



The second aspect provides a method for identifying a subset 
of components of a subject which are capable of classifying 
the subject into one of a plurality of predefined groups, 
wherein each group is defined by a response to a test 
treatment, the method comprising the steps of: 

exposing a plurality of subjects to the test treatment 
and grouping the subjects into response groups based on 
responses to the treatment; 
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measuring components of the subjects; and 
identifying a subset of components that is capable of 

classifying the subjects into response groups using a 

statistical analysis method. 

Preferably, the statistical analysis method comprises the 
method according to the first aspect of the present 
invention. 



Once a subset of components has been identified, that subset 
can be used to classify subjects into groups such as those 
that are likely to respond to the test treatment and those 
that are not. In this manner, the method of the present 
invention permits treatments to be identified which may be 
effective for a fraction of the population, and permits 
identification of that fraction of the population that will 
be responsive to the test treatment. 

According to a third aspect of the present invention, there 
is provided an apparatus for identifying a subset of 
components of a subject, the subset being capable of being 
used to classify the subject into one of a plurality of 
predefined response groups wherein each response group, is 
formed by exposing a plurality of subjects to a test 
treatment and grouping the subjects into response groups 
based on the response to the treatment, the apparatus 
comprising: 

an input for receiving measured components of the 
subjects; and 

processing means operable to identify a subset of 
components that is capable of being used to classify the 
subjects into response groups using a statistical analysis 
method. 



Preferably, the statistical analysis method comprises the 
method according to the first or second aspect. 
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According to a fourth aspect of the present invention, there 
is provided a method for identifying a subset of components 
of a subject that is capable of classifying the subject as 
being responsive or non- responsive to treatment with a test 
compound, the method comprising the steps of: 

exposing a plurality of subjects to the test compound 
and grouping the subjects into response groups based on each 
subjects response to the test compound; 

measuring components of the subjects; and 
identifying a subset of components that is capable of 
being used to classify the subjects into response groups 
using a statistical analysis method. 

Preferably, the statistical analysis method comprises the 
method according to the first aspect. 

According to a fifth aspect of the present invention, there 
is provided an apparatus for identifying a subset of 
components of a subject, the subset being capable of being 
used to classify the subject into one of a plurality of 
predefined response groups wherein each response group is 
formed by exposing a plurality of subjects to a compound and 
grouping the subjects into response groups based on the 
response to the compound, the apparatus comprising: 

an input operable to receive measured components of 
the subjects; 

processing means operable to identify a subset of 
components that is capable of classifying the subjects into 
response groups using a statistical analysis method. 

Preferably, the statistical analysis method comprises the 
method according to the first or second aspect of the 
present invention. 

The components that are measured in the second to fifth 
aspects of the invention may be, for example, genes or small 
nucleotide polymorphisms (SNPs) , proteins, antibodies, 
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carbohydrates , lipids or any other measurable conponent of 
the subject. 

In a particularly embodiment of the fifth aspect, the 
compound is a pharmaceutical compound or a composition 
comprising a pharmaceutical compound and a pharmaceutical ly 
acceptable carrier . 

The identification method of the present invention may be 
implemented by appropriate computer software and hardware. 

According to a sixth aspect of the present invention, there 
is provided an apparatus for identifying a subset of 
components of a system from data generated from the system 
from a plurality of samples from the system, the subset 
being capable of being used to predict a feature of a test 
sample, the apparatus comprising: 

a processing means operable to: 

obtain a linear combination of components of the system 
and obtain weightings of the linear combination of 
components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 
least one training sample having a known feature; 

obtaining a model of a probability distribution of a 
second feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 
the linear combination of the components, the prior 
distribution comprising an adjustable hyperprior which 
allows the prior probability mass close to zero to be varied 
wherein the hyperprior is not a Jeffrey's hyperprior; 

combining the prior distribution and the model to 
generate a posterior distribution; and 

identifying the subset of components having component 
weights that maximize the posterior distribution. 
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Preferably, the processing means comprises a computer 
arranged to execute software. 

According to a seventh aspect of the present invention, 
5 there is provided a computer program which, when executed by 
a computing apparatus, allows the computing apparatus to 
carry out the method according to the first aspect of the 
present invention. 

10 The computer program may implement any of the preferred 

algorithms and method steps of the first or second aspect of 
the present invention which are discussed above. 

According to an eighth aspect of the present invention, 
15 there is provided a computer readable medium comprising the 
computer program according with the seventh aspect of the 
present invention. 

According to a ninth aspect of the present invention, there 
20 is provided a method of testing a sample from a system to 

identify a feature of the sample, the method comprising the 
steps of testing for a subset of components that are 
diagnostic of the feature, the subset of components having 
been determined by using the method according to the first 
25 or second aspect of the present invention. 

Preferably, the system is a biological system. 

According to a tenth aspect of the present invention, there 
3 0 is provided an apparatus for testing a sample from a system 
to determine a feature of the sample, the apparatus 
comprising means for testing for components identified in 
accordance with the method of the first or second aspect of 
the present invention. 

35 

According to an eleventh aspect of the present invention, 
there is provided a computer program which, when executed by 
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on a computing device, allows the computing device to carry 
out a method of identifying components from a system that 
are capable of being used to predict a feature of a test 
sample from the system, and wherein a linear combination of 
5 components and component weights is generated from data 
generated from a plurality of training samples, each 
training sample having a known feature, and a posterior 
distribution is generated by combining a prior distribution 
for the component weights comprising an adjustable 
10 hyperprior which allows the probability mass close to zero 
to be varied wherein the hyperprior is not a Jeffrey's 
hyperprior, and a model that is conditional on the linear 
combination, to estimate component weights which maximise 
the posterior distribution. 

15 

Where aspects of the present invention are implemented by 
way of a computing device, it will be appreciated that any 
appropriate computer hardware e.g. a PC or a mainframe or a 
networked computing infrastructure, may be used. 

20 

According to a twelfth aspect of the present invention, 
there is provided a method of identifying a subset of 
components of a biological system, the subset being capable 
of predicting a feature of a test sample from the biological 

25 system, the method comprising the steps of: 

obtaining a linear combination of components of the 
system and weightings of the linear combination of 
components, each of the weightings having a value based on 
data obtained from at least one training sample, the at 

30 least one training sample having a known first feature; 

obtaining a model of a probability distribution of a 
second feature, wherein the model is conditional on the 
linear combination of components; 

obtaining a prior distribution for the weightings of 

35 the linear combination of the components, the prior 

distribution comprising an adjustable hyperprior which 
allows the probability mass close to zero to be varied; 
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combining the prior distribution and the model to 
generate a posterior distribution; and 

identifying the subset of components based on the 
weightings that maximize the posterior distribution • 

BRIEF DESCRIPTION OF THE DRAWINGS 

Notwithstanding any other embodiments that may fall within 
the scope of the present invention/ an embodiment of the 
present invention will now be described, by way of example 
only, with reference to the accompanying figures, in which: 

figure 1 provides a flow chart of a method according 
to the embodiment of the present invention; 

figure 2 provides a flow chart of another method 
according to the embodiment of the present invention; 

figure 3 provides a block diagram of an apparatus 
according to the embodiment of the present invention; 

figure 4 provides a flow chart of a further method 
according to the embodiment of the present invention; 

figure 5 provides a flow chart of an additional method 
according to the embodiment of the present invention; and 

figure 6 provides a flow chart of yet another method 
according to the embodiment of the present invention. 

DETAILED DESCRIPTION OF AN EMBODIMENT 

The embodiment of the present invention identifies a 
relatively small number of components which can be used to 
identify whether a particular training sample has a feature. 
The components are "diagnostic" of that feature, or enable 
discrimination between samples having a different feature. 
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The number of components selected by the method can be 
controlled by the choice of parameters in the hyperprior. It 
is noted that the hyperprior is a gamma distribution with a 
specified shape and scale parameter. Essentially, from all 
the data which is generated from the system, the method of 
the present invention enables identification of a relatively 
small number of components which can be used to test for a 
particular feature. Once those components have been 
identified by this method, the components can be used in 
future to assess new samples. The method of the present 
invention utilises a statistical method to eliminate 
components that are not required to correctly predict the 
feature. 



The inventors have found that component weightings of a 
linear combination of components of data generated from the 
training samples can be estimated in such a way as to 
eliminate the components that are not required to correctly 
predict the feature of the training sample. The result is 
that a subset of components are identified which can 
correctly predict the feature of the training sample. The 
method of the present invention thus permits identification 
from a large amount of data a relatively small and 
controllable number of components which are capable of 
correctly predicting a feature. 

The method of the present invention also has the advantage 
that it requires usage of less computer memory than prior 
art methods. Accordingly, the method of the present 
invention can be performed rapidly on computers such as, for 
example, laptop machines. By using less memory, the method 
of the present invention also allows the method to be 
performed more quickly than other methods which use joint 
(rather than marginal) information on components for 
analysis of, for example, biological data. 
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The method of the present invention also has the advantage 
that it uses joint rather than marginal information on 
components for analysis . 

A first embodiment relating to a multiclass logistic 
regression model will now be described. 

A. Multi Class Logistic regression model 

The method of this embodiment utilises the training samples 
in order to identify a subset of components which can 
classify the training samples into pre-defined groups. 
Subsequently, knowledge of the subset of components can be 
used for tests, for example clinical tests, to classify 
samples into groups such as disease classes. For example, a 
subset of components of a DNA microarray may be used to 
group clinical samples into clinically relevant classes such 
as, for example, healthy or diseased. 

In this way, the present invention identifies preferably a 
small and controllable number of components which can be 
used to identify whether a particular training sample 
belongs to a particular group. The selected components are 
"diagnostic" of that group, or enable discrimination between 
groups. Essentially, from all the data which is generated 
from the system, the method of the present invention enables 
identification of a small number of components which can be 
used to test for a particular group. Once those components 
have been identified by this method, the components can be 
used in future to classify new samples into the groups. The 
method of the present invention preferably utilises a 
statistical method to eliminate components that are not 
required to correctly identify the group the sample belongs 
to. 



The samples are grouped into sample groups (or "classes") 
based on a pre -determined classification. The classification 
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may be any desired classification by which the training 
samples are to be grouped. For example, the classification 
may be whether the training samples are from a leukemia cell 
or a healthy cell, or that the training samples are obtained 
5 from the blood of patients having or not having a certain 

condition, or that the training samples are from a cell from 
one of several types of cancer as compared to a normal cell. 

In one embodiment, the input data is organised into an 
10 /ixpdata matrix X = with n training samples and p 

components . Typically, p will be much greater than n . 

In another embodiment, data matrix X may be replaced by an n 
x n kernel matrix K to obtain smooth functions of X as 
15 predictors instead of linear predictors. An example of the 
kernel matrix K is k ljS =exp < -0 . 5* (Xi-Xj) t (x 1 -x j ) /a 2 ) where the 
subscript on x refers to a row number in the matrix X. 
Ideally, subsets of the columns of K are selected which give 
sparse representations of these smooth functions. 

20 

Associated with each sample class (group) may be a class 
label y i9 where y f = k,k e {l,...,G}, which indicates which of G 
sample classes a training sample belongs to. We write the 
nxl vector with elements y i as y. . Given the vector y we 
25 can define indicator variables 

p fi. y, = g 

" lO, otherwise (A1) 

In one embodiment, the component weights are estimated using 
a Bayesian statistical model (see Kotz and Johnson, 1983) . 

30 Preferably, the weights are estimated by maximising the 
posterior distribution of the weights given the data 
generated from each training sample. This results in an 
objective function to be maximised consisting of two parts. 
The first part a likelihood function and the second a prior 

35 distribution for the weights which ensures that zero weights 
are preferred whenever possible. In a preferred embodiment. 
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the likelihood function is derived from a xnulticlass 
logistic model. Preferably, the likelihood function is 
computed from the probabilities: 



and 




(A3) 



wherein 



P ig is the probability that the training sample with input 
data Xi will be in sample class gr; 

xj p g is a linear combination generated from input data from 
training sample i with component weights 

xj is the components for the 1 th Row of X and P 9 is a set of 
component weights for sample class g; 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the apriori 
assumption that most of the component weights are zero. 

In one embodiment, components weights pg in equation (A2) 
are estimated in a manner whereby most of the values are 
zero, yet the samples can still be accurately classified. 

In one embodiment, the component weights are estimated by 
maximising the posterior distribution of the weights given 
the data in the Bayesian model referred to above. 

Preferably, the component weights are estimated by 
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(a) specifying a hierarchical prior for the component 
weights fi l9 ... 9 fi G ^$ and 

(b) specifying a likelihood function for the input data; 

(c) determining the posterior distribution of the weights 
given the data using (A5) ; and 

(d) determining component weights which maximise the 
posterior distribution. 

In one embodiment , the hierarchical prior specified for the 
parameters J3 l9 ... 9 j3 c ^ is of the form: 

P(A- ..Am)- jH^WH^)^ (A4) 

where 0 1 =(#, , r r =(r[ *£_,), />(^|^j is ^(O.diagj^jJ and 

^(Tg) is a suitable prior. ' ' 

In one embodiment, p(^ g ) = t{p(^ g ) where pfe) is a prior 

1=1 

wherein t^ 2 =l/r? has an independent gamma distribution. 

In another embodiment, p\rf g ) is a prior wherein r? has an 
independent gamma distribution. 

In one embodiment, the likelihood function is L(y\0 x ,.. .,J3 G _ { J of 
the form in equation (8) and the posterior distribution of 
J3 and r given y is 

/>(/?T 2 a L(y\p)p(py) p ( T >) (A5) 

In one embodiment, the likelihood function has a first and 
second derivative. 

In one embodiment, the first derivative is determined from 
the following algorithm: 
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aiogz 



(A6) 



wherein e l g =(e (g9 i = 1,...,*) , p' g =(p (g J = l,».,w) are vectors indicating 
5 membership of sample class gr and probability of class g 
respectively. 



10 



In one embodiment, the second derivative is determined from 
the following algorithm: 



d 2 logZ r X l 
^-^- = -X diag{S hgPg -p hPg }x 



(A7) 



where d hg is 1 if h equals g and zero otherwise. 

15 Equation A6 and equation A7 may be derived as follows: 

(a) Using equations (Al), (A2) and (A3), the likelihood 
function of the data can be written as: 



20 



G-\ 

n 



G-\ 



i+5> 



4fa 



(A8) 



(b) Taking logs of equation (A6) and using the fact that 

G 

2^=1 for all i gives: 



25 



(7-1 



'=1 ^g=l I B=l 



V. 8=1 



(A9) 



(c) Differentiating equation (A8) with respect to fig gives 
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aiogz. 



= xT [e g -Pg), g = \,...,G-\ (A10) 



whereby e g =(e igt i = \,n) , p' g = (p ig ,i = \,n) are vectors indicating 
membership of sample class g and probability of class g 
respectively. 

(d) The second derivative of equation (9) has elements 

j0^- = -X T diag{d hgPg -p h p g }x (All) 

where 

h = g 



Mi 



otherwise 



Component weights which maximise the posterior distribution 
of the likelihood function may be specified using an EM 
algorithm comprising an E step and an M step. 

In conducting the EM algorithm, the E step preferably 
comprises the step computing a term of the form: 

c- „ (Alia) 



where d ig = E{l/rf g | fi^y™ , d g = (d lg J 2g ,...,d pg f and d ig = V d ig = 0 if fr g = 0 

Preferably, equation (11a) is computed by calculating the 
conditional expected value of t ig 2 =l/r ig 2 when p[P ig \Tt g ) is 
N{0,tf g ) and p[*f g ) has a specified prior distribution. 
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Explicit formulae for the conditional expectation will be 
presented later. 



5 Typically, the EM algorithm comprises the steps: 

(a) performing an E step by calculating the 

conditional expected value of the posterior 
distribution of component weights using the 
10 function: 

Q=Q(r\y,r) = iogL--^yTdia g {d g (r g )} r g (Ai2) 

*=1 



where x?0 g =xfP g y g in equation (8) , d{y g )=P*d g . and 
15 d g is defined as in equation (11a) evaluated at 

fig =P gYg 9 Here p ff is a matrix of zeroes and ones 
derived from the identity matrix such that P 9 T fi ff 
selects non-zero elements of f} g which are denoted 

20 

(b) performing an M step by applying an iterative 
procedure to maximise Q as a function of y 
whereby : 



25 



W) lay J 



(A13) 



where a' is a step length such that 0£a r <l; and y = (y g , 
g~l, , G-l) . 

30 Equation (A12) may be derived as follows: 

Calculate the conditional expected value of (A5) given the 
observed data y and a set of parameter estimates P . 
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Q = Q(fl\yj) = E{logp(^,r \y)\yj) 



Consider the case when components of /? (and /? ) are set to 

zero i,e for g = 1,..., G-l, fl g = P g y g and P g =P g f g . 

Ignoring terms not involving y and using (A4) , (A5) , (A9) we 
get: 



Q-^L-\%±E\%,f\ 

e=iogi -l!i>; E {iH 



(A14) 



where x ?Pg= x f p g f g (A8) . d g (y g )= P*d g where J is defined 

as in equation (Alia) evaluated at j3 g =P g y g . 

Note that the conditional expectation can be evaluated from 
first principles given (A4) • Some explicit expressions are 
given later. 

The iterative procedure may be derived as follows: 

To obtain the derivatives required in (11)/ first note that 

from (A8), (A9) and (A10) writing d(y)={d g (r g ) 9 g = 1,...,G-1} , we 

get 




■tea*-* 



) w 



diag{d(y)}~ y 



_-*G-l ( e C-l — Po-l )_ 



-diag{d(f)} V 



(A15) 



WO 2004/104856 



PCT/AU2004/000696 



- 32 - 



and 



d 



l Q_(d/3) d 2 \ogL (d0) T „,_ 2 



where 



* gh =diag{5 eh p g -p gPh }, 

S =j 1 ' * =/l 

(0, otherwise 



+diag{d(y)Y 



(A16) 



and 



X T g =P g T X T ,g = \,...G-\ 



(A17) 



In a preferred embodiment, the iterative procedure may be 
simplified by using only the block diagonals of equation 
(A16) in equation (A13) . For g = l,...G-l, this gives: 

r? =r' g ^'{x T 8 ^x g +^K(y,)}- 2 }" , {^(e,- jP ,)-^{^(^ ) } r ;} - 

(A18) 



Rearranging equation (A18) leads to 



7? = y' g +/)"' {y/ (e g - Pg )-diag{d g if g )Y r' g ] 

(A19) 
where 
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Writing p(g) for the number of columns of Y g , (A19) requires 
the inversion of a p(g) xp(g) matrix which may be quite 
large. This can be reduced to an nxn matrix for p{g)>n by 
noting that: 



(A20) 



where Z g =ts\J g . Preferably, (A19) is used when p(g)>n and 
(A19) with (A20) substituted into equation (A19) is used 
when p(g)£n. 

Note that when rj has a Jeffreys prior we have: 



In one embodiment, tj -1/rJ has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of if g is: 

7(t* ,b,k)=b-' (t? /b) k, exp(-t,»/r(k) 

Omitting subscripts to simplify the notation, it can be 
shown that 

E{t 2 |0} = (2k+l)/(2/b + 0 2 ) (A21) 
as follows: 



Define 

oo 

I(p,b,k)= J(t 2 ) p t exp(-O.50 2 t 2 )«Kt 2 ,b,k)dt 2 

0 

then 
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Proof 

Let s = 0 2 /2 then 

I(p,b,k)=b pM>s J(t^)^ 05 exp(-st 2 h(t 2 ,b,k)dt 2 

0 

Now using the substitution u = t 2 /b we get 

oo 

IfobJcH) 1 * 05 JOO"** 5 exp(-subft(u,l,k)du 

0 

Now let s'=bs and substitute the expression for 7(u,l,k) . This 
gives 

QO 

Up&MyW™ J expKsM-l)u)u^ 5 Mu / T(k) 

0 

Looking up a table of Laplace transforms , eg Abramowitz and 
Stegun, then gives the result. 

The conditional expectation follows from 

E{t 2 |^}=I(l,b,k)/I(0,b,k) 
= (2k+l)/(2/b + /3 2 ) 

As k tends to zero and b tends to infinity we get the 
equivalent result using Jeffreys prior. For example, for 
k=0.005 and b=2*10 5 

E{t 2 |0}=(1.O1)/(1O- 5 + /3 2 ) 

Hence we can get arbitrarily close to the Jeffreys prior 
with this proper prior. 

The algorithm for this model has 
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d=E{t 2 | p}* 5 
=E {±| fa** 

where the expectation is calculated as above. 

In another embodiment/ r? has an independent ^axnma 
5 distribution with scale parameter b>0 and shape parameter 
k>0. It can be shown that 

Ju k - 3/2, exp(-(7 ii! /u+u))du 
b Ju'-'^'expH^/u +u) )du 

0 

= [2 1 K^(2^) 
Vb|ft g |K I/2 . k (27^) <A22) 

1 (2^~)K 3/2 ^(27^7) 
I & 8 I 2 K 1/2 . k (27^) 

10 where 7 ig =jS ig 2 /2b and K denotes a modified Bessel function. 
For k=l in equation (A22) 

E{r- 2 |/3}=V2/b(l/|^|) 

15 For K=0.5 in equation (A22) 

E{^|^ ig }=V2/b(l/|^|){K 1 (2 > /7 i8 )/K 0 (2 > /7 ig )} 
or equivalently 

20 

E{r^ |ft> - (l/|0 ig f ){2V7 i K I (27 7ig )/K 0 (2 > /7 ig )} 
Proof of (A.l) 

From the definition of the conditional expectation, writing 
25 y=0 2 /2b , we get 
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CD 

J r-Vexp^- 2 )^ 1 (r- 2 /b) k, exp(r- 2 /b)dr 2 
E{r- 2 |)S}=^_ 

JV'expC-^- 2 )b"' (r- 2 /b) kl exp(r- 2 /b)dr 2 

0 

Rearranging, simplifying and making the substitution u=r 2 /b 
gives the first equation in (A22) . 

The integrals in (22) can be evaluated by using the result 

where K denotes a modified Bessel function, see 
Watson (1966) . 

Examples of members of this class are k=l in which case 

E{ V 2 l^ ie > = V27b(l/|i3 i8 1) 

which corresponds to the prior used in the Lasso technique, 
Tibshirani(1996) . See also Figueiredo (2001) . 

The case k=0.5 gives 

B{iJ |/? jg }= ^(l/|/3 ig |){K,(2 > /^)/K 0 (27^)} 

or equivalently 

IA 8 }= (l/|ft g | 2 ){2V^K 1 (27^)/K 0 (2^)} 

where K 0 and K x are modified Bessel functions , see Abraznowitz 
and Stegun(1970) . Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) . The expressions above demonstrate the 
connection with the Lasso model and the Jeffreys prior 
model. 
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It will be appreciated by those skilled in the art that as k 
tends to zero and b tends to infinity the prior tends to a 
Jeffreys improper prior. 

In one embodiment, the priors with 0< k £1 and b>0 form a 
class of priors which might be interpreted as penalising non 
zero coefficients in a manner which is between the Lasso 
prior and the specification using Jeffreys hyper prior. 

The hyperparameters b and k can be varied to control the 
number of components selected by the method. As k tends to 
zero for fixed b the number of components selected can be 
decreased and conversely as k tends to 1 the number of 
selected components can be increased. 

in a preferred embodiment, the EM algorithm is performed as 
follows : 



Set n=0 ,P g =I and choose an initial value for y°. Choose a 
value for b and k in equation (A22) . For example b=le7 and 
k=0 gives the Jeffreys prior model to a good degree of 
approximation. This is done by ridge regression of 
log(Pi g /p±o ) on Xi where p ig is chosen to be near one for 
observations in group g and a small quantity >0 otherwise 
- subject to the constraint of all probabilities summing 
to one. 



2. Do the E step i.e evaluate Q = Q{y\y,y"). Note that this 
also depends on the values of k and b. 

3. Set t=0. For g = \ y ... y G-\ calculate: 

a) S g = r?-Y' 8 using (A19) with (A20) substituted into 
(A19) when p(g)2n. 

(b) Writing S' =(S^g = l,...,G-l) Do a line search to find the 
value of a' in =r' +a'S' which maximises (or simply 
increases) (12) as a function of a' . 
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c) set =y' and t=t+l 
Repeat steps (a) and (b) until convergence. 

This produces y ,n+, say which maximises the current Q function 
as a function of y . 

For £ = l,.„G-l determine S g =|y iJ/J 1 £eraax\rV J 

Where e<zl , say 10" 5 . Define P g so that p ig =0 for ie^and 

This step eliminates variables with small coefficients from 
the model. 



4. Set n=n+l and go to 2 until 



convergence . 



A second embodiment relating to an ordered cat logistic 
regression will now be described. 

B. Ordered categories model 

The method of this embodiment may utilise the training 
samples in order to identify a subset of components which 
can be used to determine whether a test sample belongs to a 
particular class. For example, to identify genes for 
assessing a tissue biopsy sample using microarray analysis, 
microarray data from a series of samples from tissue that 
has been previously ordered into classes of increasing or 
decreasing disease severity such as normal tissue, benign 
tissue, localised tumour and metastasised tumour tissue are 
used as training samples to identify a subset of components 
which is capable of indicating the severity of disease 
associated with the training samples. The subset of 
components can then be subsequently used to determine 
whether previously unclassified test samples can be 
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classified as normal, benign, localised tumour or 
metastasised tumour. Thus, the subset of components is 
diagnostic of whether a test sample belongs to a particular 
class within an ordered set of classes. It will be apparent 
that once the subset of components have been identified 
only the subset of components need be tested in future ' 
diagnostic procedures to determine to what ordered class a 
sample belongs* 

The method of the invention is particularly suited for the 
analysis of very large amounts of data. Typically, large 
data sets obtained from test samples is highly variable and 
often differs significantly from that obtained from the 
training samples. The method of the present invention is 
able to identify subsets of components from a very large 
amount of data generated from training samples, and the 
subset of components identified by the method can then be 
used to classifying test samples even when the data 
generated from the test sample is highly variable compared 
to the data generated from training samples belonging to the 
same class. Thus, the method of the invention is able to 
identify a subset of components that are more likely to 
classify a sample correctly even when the data is of poor 
quality and/or there is high variability between samples of 
the same ordered class. 

The components are "predictive" for that particular ordered 
class. Essentially, from all the data which is generated 
from the system, the method of the present invention enables 
xdentification of a relatively small number of components 
which can be used to classify the training data. Once those 
components have been identified by this method, the 
components can be used in future to classify test samples. 
The method of the present invention preferably utilises a 
statistical method to eliminate components that are not 
required to correctly classify the sample into a class that 
is a member of an ordered class. 
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In the following there are N samples, and vectors such as y, 
x and u have components y 1# Zi and u A for i = 1, N . vector 
multiplication and division is defined componentwise and 
diag{ • } denotes a diagonal matrix whose diagonals are 
equal to the argument. We also use | | ■ | | to denote 
Euclidean norm. 



Preferably, there are N observations y t where y t takes 

integer values 1,...,Q. The values denote classes which are 

ordered in some way such as for example severity of disease. 

Associated with each observation there is a set of 

covariates (variables, e.g gene expression values) arranged 

into a matrix X* with n row and p columns wherein n is the 

samples and p the components. The notation x* T denotes the 

1 th row of X*. Individual (sample) i has probabilities of 

belonging to class k given by n ik =7c k (x i ) . 

Define cumulative probabilities 
* 

y,k =S 7r rt ' k = 1,.„,G 

Note that y ik is just the probability that observation i 
belongs to a class with index less than or equal to k. Let C 
be a N by p matrix with elements c ff given by 

~ _ / 1. if observation i in class j 
*-iJ-\0, otherwise 

and let R be an n by P matrix with elements r & given by 




These are the cumulative sums of the columns of C within 
rows. 



For independent observations (samples) the likelihood of the 
data may be written as: 
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=nn -H 



and the log likelihood L may be written as: 



N G ~ x ( y \ C 

' = 1( >g — +(r tt+I -r tt )logl 

«=i y=i ^Xtt+iy 



The continuation ratio model may be adopted here as follows: 



log it 



T *« Yik Ulogit 



for k = 2,...,G , see McCullagh and Nelder(1989) and 
McCullagh(1980) and the discussion therein. Note that 



logit^i^L_Jkj = -logi t ^-p-j . 



The likelihood is equivalent to a logistic regression 
likelihood with response vector >>and covariate matrix X 

y=vec{R} 

*, -[/« VcW] 

where/,,., is the G-l by G-l identity matrix and l c _, is a G-l 
by 1 vector of ones. 

Here vec{ } takes the matrix and forms a vector row by row. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a priori 
assumption that most of the component weights are zero. 



Following Figueiredo(2001), in order to eliminate redundant 
variables (covariates) , a prior is specified for the 
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parameters /T by introducing a p x 1 vector of 
hyperparameters . 

Preferably, the prior specified for the component weights is 
of the form 

r 2 (Bl) 

where p(p'\v 2 ) is tf(o,diag{v 2 }) and p(v 2 ) is a suitably chosen 

hyperprior. For example, p\y 2 )oeY\p(yf) is a suitable form of 
Jeffreys prior. 

In another embodiment, p(vf) is a prior wherein t. 2 =l/v/ has 
an independent gamma distribution. 

In another embodiment, p{vf) is a prior wherein vf has an 
independent gamma distribution. 

The elements of theta have a non informative prior. 

Writing L(y\p* d) for the likelihood function, in a Bayesian 

framework the posterior distribution of p , 9 and v given 
y is 

p(p*<pv\y) a L(y\p*q)p(py) p {y) {2 ) 

By treating v as a vector of missing data, an iterative 
algorithm such as an EM algorithm (Dempster et al, 1977) can 
be used to maximise (2) to produce maximum a posteriori 
estimates of p and 9. The prior above is such that the 
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maximum a posteriori estimates will tend to be sparse i.e. 
if a large number of parameters are redundant, many 
components of p* will be zero. 

Preferably fi T =(0 T ,j3' T ) in the following: 

For the ordered categories model above it can be shown that 

(ID 

E (0>=-X , diag{ / ,(l- / z)}X (12) 

Where Mt = exp(x^)/(l + exp(x^)) and 0 T = (0 2 ,...,O c ,p' T ) . 

The iterative procedure for maximising the posterior 
distribution of the components and component weights is an 
EM algorithm, such as, for example, that described in 
Dempster et al, 1977. Preferably, the EM algorithm is 
performed as follows: 

1. Chose a hyperprior and values b and k for its parameters. 
Set n=0, S 0 = {1,2,..., p } , $™ . and e =10" s (say). Set the 
regularisation parameter *: at a value much greater than 1, 
say 100. This corresponds to adding 1/tc 2 to the first G-l 
diagonal elements of the second derivative matrix in the M 
step below. 

If p £ N compute initial values 0* by 

0*=(X«X+XI)-'X T g(y+a (B2) 
and if p > N compute initial values p* by 

0*=}a -x^xx^xiy'xox^fr+fl {B3) 

where the ridge parameter A, satisfies 0 < X <, 1 and £ is 
small and chosen so that the link function g(z) = log(z/(l-z)) is 
well defined at z = y+£ . 
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2. Define 

5 0(n)_ t Pi » i e S n 

0, otherwise 

and let P n be a matrix of zeroes and ones such that the 
nonzero elements y (n> of p <n) satisfy 

Q y«) =Pn T /S (n) t j8 (-)=p n7 <") 

y=P n T 0 , 0=P„7 



15 



25 



Define w p = (w fii ,i = l,p) , such that 

r 1, i ^ G 
0, otherwise 



and let w r = P„w n 

Y 1 p 



3. Perform the E step by calculating 



(15) 



Q(P I P (n) , <P (n) ) = E{ logp(P, q> , v | y) | y, 0<">} 

= Uy\ P,^)-0.5 (||G3*w,)/d (n >|| 2 ) 

where L is the log likelihood function of y and 

20 d !g =E{l/r? g | p ig y° s , d g = {d lg ,d 2g d K ) T and for convenience we 

define d lg = 1/^ = 0 if )0 fe = O. . Using 0=P n yand ^ = P„Y n) (15) 
can be written as 



Q(7lV n \4>< n >)= L(y|P 0 y, 0<">)-O.5 (IKy^)/^)!! 2 ) (B4) 
with d(y M )=P?d (n) evaluated at p Cn) =P n y™. 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y 0 = y (n) and for r=0,l,2,„. y r+1 
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= Yr + ctr 5 r where Or is chosen by a line search algorithm to 
ensure Q( 7rH | V°>, </><*>) >Q( Yr | y*>, 0<">) . 
For p £ N use 

8,= A(d\ y ™))[Y;V;X+lY\YZz r ^^) (B5) 

where 

rfVn > )= { <*<r w ). i * G 

fc y otherwise 

Y n T =A(d*(y ( °>))P n T X T 
V/'^iag^Cl-^)} 

Zr = (y-Mr) 

and /i r = exp( XP n 7 r )/(l+exp( XP n 7 r )) . 
For p > N use 

S r =A(dV n >))P .Y/C^Yj+V.rYJCYX.-^) (B6) 
with V r and z r defined as before. 

Let y* be the value of y r when some convergence criterion is 
satisfied e.g 

I I Yr - Yr*i| | < e (for example 10" 5 ) . 

5. Define p* = P n y* , S^-tffcG: | ft | >max(|A \U X ) } where e x is a 
small constant, say le-5. Set n=n+l . 

6. Check convergence. If | | y* - y (n) | | < e 2 where £ 2 is 
suitably small then stop, else go to step 2 above. 

Recovering the probabilities 

Once we have obtained estimates of the parameters P are 
obtained, calculate 
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a : ,= 



for 1=1,..., N and k » 2,...,G. 



Preferably, to obtain the probabilities we use the recursion 

K iG =a iG 

and the fact that the probabilities sum to one, for i = 
1,...,N. 



In one embodiment the covariate matrix X 
10 with rows x 4 T can be replaced by a matrix K with ij th element 
kij and k ±j = k( x 4 - Xj ) for some kernel function k. This 
matrix can also be augmented with a vector of ones. Some 
example kernels are given in Table 1 below, see Evgeniou et 
al(1999). 

15 



Kernel function 


Formula for k( x - y ) 


Gaussian radial basis function 


exp( - | x - y | | 2 / a) , a>0 


Inverse multiquadric 


(1 x - y | 2 + c 2 )- 1/2 


ntultiquadric 


(1 x - y | 2 + c 2 ) * 


Thin plate splines 


II x - y H 2 - 1 

I I x - y | | 2n ln(| 1 x - y | |) 


Multi layer perceptron 


tanh( xy-G ) , for suitable 9 


Ploynomial of degree d 


(1 + xy ) d 


B splines 


B 2n+ i(x - y) 


Trigonometric polynomials 


sin(( d +1/2 ) (x-y))/sin((x- 

y)/2) 



Tsd^le 1: Examples of kernel functions 
In Table 1 the last two kernels are preferably one 
dimensional i.e. for the case when X has only one column. 
2 0 Multivariate versions can be derived from products of these 
kernel f inactions. The definition of B 2n *i can be found in De 
Boor(1978). Use of a kernel function results in mean values 
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which are smooth (as opposed to transforms of linear) 
functions of the covariates X. Such models may give a 
substantially better fit to the data. 

A third embodiment relating to generalised linear models 
will now be described. 

C. Generalised Linear Models 

The method of this embodiment utilises the training samples 
in order to identify a subset of components which can 
predict the characteristic of a sample. Subsequently, 
knowledge of the subset of components can be used for tests, 
for example clinical tests to predict unknown values of the 
characteristic of interest. For example, a subset of 
components of a DNA microarray may be used to predict a 
clinically relevant characteristic such as, for example, a 
blood glucose level, a white blood cell count, the size of a 
tumour, tumour growth rate or survival time. 

In this way, the present invention identifies preferably a 
relatively small number of components which can be used to 
predict a characteristic for a particular sample. The 
selected components are "predictive" for that 
characteristic. By appropriate choice of hyparparameters in 
the hyper prior the algorithm can be made to select subsets 
of varying sizes. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a small number of 
components which can be used to predict a particular 
characteristic. Once those components have been identified 
by this method, the components can be used in future to 
predict the characteristic for new samples. The method of 
the present invention preferably utilises a statistical 
method to eliminate components that are not required to 
correctly predict the characteristic for the sample. 
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The inventors have found that component weights of a linear 
combination of components of data generated from the 
training samples can be estimated in such a way as to 
eliminate the components that are not required to predict a 
characteristic for a training sample. The result is that a 
subset of components are identified which can correctly 
predict the characteristic for samples in the training set. 
The method of the present invention thus permits 
identification from a large amount of data a relatively 
small number of components which are capable of correctly 
predicting a characteristic for a training sample, for 
example, a quantity of interest. 

The characteristic may be any characteristic of interest. In 
one embodiment, the characteristic is a quantity or measure. 
In another embodiment, they may be the index number of a 
group, where the samples are grouped into two sample groups 
(or "classes") based on a pre -determined classification. The 
classification may be any desired classification by which 
the training samples are to be grouped. For example, the 
classification may be whether the training samples are from 
a leukemia cell or a healthy cell, or that the training 
samples are obtained from the blood of patients having or 
not having a certain condition, or that the training samples 
are from a cell from one of several types of cancer as 
compared to a normal cell. In another embodiment the 
characteristic may be a censored survival time, indicating 
that particular patients have survived for at least a given 
number of days. In other embodiments the quantity may be any 
continuously variable characteristic of the sample which is 
capable of measurement, for example blood pressure. 

In one embodiment, the data may be a quantity y ( . where 
ie{l t ...,N} . We write the nxl vector with elements y t as y. We 
define a p x 1 parameter vector p of component weights (many 
of which are expected to be zero) , and a q x 1 vector of 
parameters q> (not expected to be zero) . Note that q could be 
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zero (i.e. the set of parameters not expected to be zero may 
be empty) . 

In one embodiment, the input data is organised into an 
nxpdata matrix X = (x iJ ) with n test training samples and p 
components. Typically, p will be much greater than n . 

In another embodiment, data matrix X may be replaced by an n 
x n kernel matrix K to obtain smooth functions of X as 
predictors instead of linear predictors. An example of the 
kernel matrix K is k lj =exp(-0.5* (xt-xj) '(Xi-Xj) /a 2 ) where the 
subscript on x refers to a row number in the matrix X. 
Ideally, subsets of the columns of K are selected which give 
sparse representations of these smooth functions. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the apriori 
assumption that most of the component weights are zero. 

In one embodiment, the prior specified for the component 
weights is of the form: 



wherein v is a p x 1 vector of hyperparameters, and where 



p{P)=\p(py) P (v*)dv> 

«2 



(CI) 



p(p\v 2 ) is tf(0,diag{v 2 }) 



and p{v 2 ) is some hyperpri- 



or 



distribution for v 2 . 



A suitable form of hyperprior is, 




In another embodiment, the hyperprior p[y 2 ) is such that 
each tj =l/j>. 2 has an independent gamma distribution. 



In another embodiment, the hyperprior p(v 2 ) is such that 
each vf has an independent gamma distribution. 
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Preferably, an uninformative prior for 9 is specified. 

The likelihood function is defined from a model for the 
distribution of the data. Preferably, in general, the 
likelihood function is any suitable likelihood function. For 
example, the likelihood function L(y\/3<p) may be, but not 
restricted to, of the form appropriate for a generalised 
linear model (GLM) , such as for example, that described by 
Nelder and Wedderburn (1972). Preferably, in this case, the 
likelihood function is of the form: 

L = log p(y 1 ft 0) m ± (MzM) + ^,0) } ( 
.=1 */W>) 

where y = (y x ,..., y n ) T and ai(0) = 0 /w ± with the w ± being a 
fixed set of known weights and 0 a single scale parameter. 

Preferably, the likelihood function is specified as follows: 
We have 

E{y i }=b'(8 i ) 

Var{y} = b'Wa^) = ^(0) (C3 ) 



Each observation has a set of covariates x A and a linear 
predictor rji. = Xi T fi. The relationship between the mean of 
the i th observation and its linear predictor is given by the 
link function Vl = g(n L ) = g( b ' (6 L ) ). The inverse of the 
link is denoted by h, i.e 
Mi = b' <0i) = h(i7i) . 

In addition to the scale parameter, a generalised linear 
model may be specified by four components: 

• the likelihood or (scaled) deviance function, 

• the link function 

• the derivative of the link function 

• the variance function. 
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Some common examples of generalised linear models are given 
in the table below. 



Distribution 


Link function 

g(/*> 


Derivative 
of link 
function 


Variance 
function 


Scale 
parame 
ter 


Gaussian 




1 


1 


Yes 


Binomial 


log(jt/(l- /*)) 


i/( Md- m>> 


Md- 
M)/n 


No 


Poisson 


log (/*) 




/* 


No 


Gamma 


1/ M 


-1/ M 2 


/* 2 


Yes 


Inverse 
Gaussian 


1/ M 2 


-2/ At 3 


M 3 


Yes 



5 In another embodiment, a quasi likelihood model is specified 
wherein only the link function and variance function are 
defined. In some instances, such specification results in 
the models in the table above. In other instances, no 
distribution is specified. 

10 

In one embodiment, the posterior distribution of 0 <p and v 
given y is estimated using: 

p(P<pv\y)a L(y\P<p)p(p\v)p{v) (C4) 

15 

wherein L(y\0<p) is the likelihood function. 

In one embodiment, v may be treated as a vector of missing 
data and an iterative procedure used to maximise equation 
20 (C4) to produce maximum a posteriori estimates of p. The 

prior of equation (CI) is such that the maximum a posteriori 
estimates will tend to be sparse i.e. if a large number of 
parameters are redundant, many components of P will be zero. 

25 As stated above, the component weights which maximise the 

posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
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maximising the posterior distribution of the components and 
component weights is an EM algorithm comprising an E step 
and an M step, such as, for example, that described in 
Dempster et al, 1977. 

In conducting the EM algorithm, the E step preferably 
comprises the step of computing terms of the form 

i-l 

p (C4a) 

where d, = <*,(#) = £{l/v, 2 1 fay** and for convenience we define 
d, = Vd, = 0ifft = 0. In the following we write d = d(j})=(d i ,d„...J n ) T . 
In a similar manner we define for example d(0 (a) ) and 
% W X^r W ) where /?<">=/>„/"> and P„ is obtained from the p 
by p identity matrix by omitting columns j for which >9j n) = 0 . 

Preferably, equation (C4a) is computed by calculating the 
conditional expected value of t*=\/v* when p[pi\vf) is N(o,v?) 

3X1(1 p{yf) has a specified prior distribution. Specific 
examples and formulae will be given later. 

In a general embodiment, appropriate for any suitable 
likelihood function, the EM algorithm comprises the steps: 

(a) Selecting a hyperprior and values for its 
parameters. Initialising the algorithm by 
setting n=0, S 0 =» {1,2,..., p } , initialise q> (0) 
, fJ* and applying a value for e, such as for 
example e = 10 " s ; 

(b) Defining 

{ i6S » (C5) 

0, otherwise 

and let Pn be a matrix of zeroes and ones such 
that the nonzero elements y (n) of P (n> satisfy 
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Y=P n T jS , 0=P n7 



(c) performing an estimation (E) step by calculating 

the conditional expected value of the posterior 
distribution of component weights using the 
function: 

Q(P | 3 (n >, <p< D >) = E{ logp(P, <p , v | y) | y, /?<">, 0 (n) } 

= Uy | ft 0< n) )-O.5 >3 r A(d(/3™)y 2 p (C6> 



where L is the log likelihood function of y. 
Using 0=P D 7and d(V n) ) as defined in (C4a) , (C6) 
can be written as 



Q(7lV n) ,0 (n) )= Uy|P n 7,</> (n) >0.5r r A(rf(/">))-V (C7) 

(d) performing a maximisation (M) step by applying 

an iterative procedure to maximise Q as a 
function of y whereby y 0 = y <n) and for r=0,l,2, 
Yr*i = Yr + ct r 5r and where ar is chosen by a 
line search algorithm to ensure 

Q(7rM I V^, 4> (D) ) > Q(7r I ^ (D) ) f ^ 



5 r =A(d( y <">))[-A(d( Y <°>))|^A(d^ (C8) 
where : 

d(V n) ) = P n V(/»y n) )as in (C4a); and 

dL =J> rdL d 2 L T d 2 L 
dy t " 80 r ' a 2 7r " d% 0 

for j8 r =P nYr . 
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Let y* be the value of yr when some convergence 
criterion is satisfied/ for example, | | yr - 
yr+l|| < 8 (for example 1(T 5 ); 

Defining (J* = P n y* , S n+l ={i: | ft | >max(|0. ) } 

where s x is a small constant, for example le-5. 

Set n=n+l and choose <p (n+1) = <p (n) +K n ( <p* - <p (n> ) 

3 

where <p* satisfies — L(y | P n V 9 <j>) = 0 and K n is a 
damping factor such that 0< K n < 1; and 

Check convergence. If | | y* - y (n > | | < e 2 where c 2 
is suitably small then stop, else go to step (b) 
above. 

In another embodiment, t? =l/v? has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t? is 

It can be shown that 

E{t 2 |0}= (2k+l)/(2/b + /3 2 ) 
as follows: 
Define 

I(p,b,k)= J(t 2 ) p t exp(-0.5j3 2 t 2 )<y(t 2 ,b,k)dt 2 
o 

then 



(e) 



(f) 



(g) 



(h) 



ICp.b^b^^rCp+k+O.S^^Xl+O.Sb/S 2 )-^ 11 * 05 ) 
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Proof 

Let s = 0 2 /2 then 

I(p,b4c)=b p405 J(t 2 /b)^exp(-st 2 h<t 2 ,b,k)dt 2 
o 

Now using the substitution u = t 2 /b we get 

I(p,b,k)=b^ 5 /(u)^ 5 exp(-subh(u,iac)du 

0 

Now let s'=bs and substitute the expression for 7(u,l,k) . This 
gives 

OO 

I(p,b,k)=b« H<)S j expHs'+Du^^'du / T(k) 

0 

Looking up a table of Laplace transforms, eg Abramowitz and 
Stegun, then gives the result. 

The conditional expectation follows from 

E{t 2 |0}=I(l,b,k)/I(O,b,k) 
= (2k+l)/(2/b + 0 2 ) 

As k tends to zero and b tends to infinity we get the 
equivalent result using Jeffreys prior. For example, for 
k=0.005 and b=2*10 5 

E{t 2 |/3} = (l.Ol)/(lO- 5 +0 2 ) 

Hence we can get arbitrarily close to the algorithm with a 
Jeffreys hyperprior with this proper prior. 

In another embodiment, v] has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0. It can be shown that 
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CO 

Ju k ' 3/2 -'exp(-(^/u+u))du 

B{'i""lA>— * 

b/u kl/2 - , exp(-(^/ u +u))du 

0 

= /2 1 K 3/2 , k (2^) 

"VblftlK^C^) (C9) 
1 (27^)K 3/2 , k (2 A A') 
l^il 2 K.^(2>/^) 

where ^=0. 2 /2b and K denotes a modified Bessel function, 
which can be shown as follows: 
For k=l in equation (c9) 

E{f i - 2 |/3 i >=^b(l/|/3 i |) 
For K=0.5 in equation (C9) 

E{v: 2 |ft>- >^b(l/|AI){K 1 (2 > /^)/K 0 (2^)} 
or equivalently 

E{«T 2 lft} = (W ){2 > A"K 1 (2 > /^)/K 0 (2^')} 

Proof 

From the definition of the conditional expectation, writing 
4=& 2 /2b, we get 

00 

/»'," 2 i'i* , expKi' i " 2 )b- , (i' i - 2 /b) k - , exp(i' i - 2 /b)di' j 2 

E{»T 2 IA)=^ ■ 

jT , exp(-^i/ i - 2 )b- , (i' i - 2 /b) k - , exp(y i - 2 /b)df i 2 

0 

Rearranging, simplifying and making the substitution u=v?/b 
gives A.l 

The integrals in A.l can be evaluated by using the result 
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where K denotes a modified Bessel function, see 
Watson (1966) . 



Examples of members of this class are k=l in which case 

E{ V : 2 \0.}= ^(l/lftl) 

which corresponds to the prior used in the Lasso technique, 
Tibshirani(1996) . See also Pigueiredo (2001) . 

The case k=0.5 gives 

EK 2 \0.}= ^b(l/|/3 i |){K 1 (2 > /^)/K 0 (2 > /^)} 
or eguivalently 

E{f f - 2 lft>- 0 W ){2V^K 1 (2 > /^)/K 0 (2 > /^)} 

where K 0 and Ki are modified Bessel functions, see Abramowitz 
and Stegun(1970). Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) . Details of the above calculations are 
given in the Appendix. 

The expressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

It will be appreciated by those skilled in the art that as k 
tends to zero and b tends to infinity the prior tends to a 
Jeffreys improper prior. 

In one embodiment, the priors with 0< k £1 and b>0 form a 
class of priors which might be interpreted as penalising non 
zero coefficients in a manner which is between the Lasso 
prior and the original specification using Jeffreys prior. 
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In another embodiment, in the case of generalised linear 
models, step (d) in the maximisation step may be estimated 

by replacing -5— with its expectation E{-^}. This is 
0 Tr d y r 

preferred when the model of the data is a generalised linear 
model . 



For generalised linear models the expected value e{ ^ L } may 



be calculated as follows. Beginning with 



*l =^ {A( i_^ L)( y ili t L)} 



dp 



(CIO) 



where X is the N by p matrix with i th row x ± T and 



(Cll) 



we get 



Equations (CIO) and (Cll) can be written 



par 

E{^}=-XVX 
where V=A( ai (<p>i 2 (f^) 2 ) 



(C12) 
(C13) 



WO 2004/104856 



PCT/AU2004/000696 



- 59 - 

Preferably , for generalised linear models, the EM algorithm 
comprises the steps: 

(a) Choose a hyper prior and its parameters. 
Initialising the algorithm by setting n=0, S 0 = 
{1,2,..., p } , 9 (0) , applying a value for e, such as 
for example e = 10" 5 , and 

If p £ N compute initial values 0* by 

^(X'X+Xiy'X^yfft (C14 ) 

and if p > N compute initial values p* by 

/f =l(i -x T (xx T +xo-'x)x T g(yfO (ci5> 

where the ridge parameter X satisfies 0 < X £ 1 and £ 
is small and chosen so that the link function is well 
defined at y+£ . 

(b) Defining 



0, otherwise 



and let Pn be a matrix of zeroes and ones such 
that the nonzero elements y(n) of P (n) satisfy 

7=P n T 0 , /3=P n7 



(c) performing an estimation (E) step by calculating 
the conditional expected value of the posterior 
distribution of component weights using the 
function: 



Q(P | 0<">, q><»>) = E { logp(p, q> , v | y) | y, /8< n >, 

= L(y | ft </><">) -0.5 yff r h(d{p™)Y 2 p 



,<»>> « - «r . , ,,^, v _, „ (CI 6) 
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where L is the log likelihood function of y. 
Using 0 =P n Y and j8< n) =P n V n) (C16) can be written 
as 

5 Q( 7 1 V n >, = Uy I P n 7, <A (n) M>.5 y f 'A(d(r w yy*r (C17 ) 

(d) performing a maximisation (M) step by applying an 
iterative procedure, for example a Newton Raphson 
iteration, to maximise Q as a function of y whereby 

10 Yo - Y (n) and for r=0,l,2,... y r+1 = y r + otr 8 r where a r 

is chosen by a line search algorithm to ensure 
0(7*. I <T) > Q(7 r I V\ 0 (n) ) , ^ f or 

p ^ N use 

15 8 r = A(d(y ( " ) ))[Y D T V;'Y n +I]- 1 (Y n T V; , z r -^^) (C18) 

where 

Y n =A(d( Y < 0 >))P n T X 
20 v=A(a i (q>)r l 2 (|^) 2 ) 

and the subscript r denotes that these quantities are 
evaluated at \i = h( XP n y r ) . 
For p > N use 

25 

8=A(d(y<»>))P -Y n T (Y n Y n T + V r ) , Y n ](Y n T V; 1 z r -^) (C19) 
with V r and z r defined as before. 

(e) Let y* be the value of y r when some convergence 
30 criterion is satisfied e.g 

I I Yr - Yfil | < e (for example 10" 5 ) . 
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(f) Define p* = P n y* , SnH ={i: | ft | >max(|/3. |* € , ) } wher e 

ei is a small constant, say le-5. Set n=n+l and 
choose <p n+1 = <p n + K„( 9* - 9°) where <b* satisfies 
^L(y I P„7 ,0) = 0 and k„ is a damping factor such that 

0< Kn £ 1. Note that in some cases the scale parameter 
is known or this equation can be solve explicitly to 
get an updating equation for 9. 

The above embodiments may be extended to incorporate quasi 
likelihood methods Wedderburn (1974) and McCullagh and 
Nelder (1983)). In such an embodiment, the same iterative 
procedure as detailed above will be appropriate, but with L 
replaced by a quasi likelihood as shown above and, for 
example. Table 8.1 in McCullagh and Nelder (1983). In one 
embodiment there is a modified updating method for the scale 
parameter 9. To define these models requires specification 
of the variance function x 2 , the link function g and the 

derivative of the link function ^ . Once these are defined 

dp 

the above algorithm can be applied. 

In one embodiment for quasi likelihood models, step 5 of the 
above algorithm is modified so that the scale parameter is 
updated by calculating 



where u and T are evaluated at P* = P n y* . Preferably, this 
updating is performed when the number of parameters s in the 
model is less than N. A divisor of N -s can be used when s 
is much less than N. 
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In another embodiment, for both generalised linear models 
and Quasi likelihood models the covariate matrix X with rows 
Xi T can be replaced by a matrix K with ij th element k ±j and 
k ±j a K(xi-Xj) for some kernel function K • This matrix can 
5 also be augmented with a vector of ones. Some example 
kernels are given in Table 2 below, see Evgeniou et 
al(1999) . 



^ernei cunccion 


Formula for k( x - y ) 


Gaussian radial basis 
function 


exp( - | | x - y | | 2 / a) , 
a>0 


TM*VAW*fl A m n Aval ■ l ■ r iT - 1 _ 

xnverse muitiquaaric 


/II 1 f 2 2 % -1/2 

1 1 1 x - y II + c 2 ) 1/2 


Multiquadric 


(| | x - y | | 2 + c 2 ) * 


Thin plate splines 


| | x - y | | 2n+1 

|| x - y || 2n ln(|| x - y ||) 


Multi layer perceptron 


tanh( x-y-0 ) , for suitable 

e 


Ploynomial of degree d 


(1 + x'y ) d 


B splines 


B 2n *i(x - y) 


Trigonometric polynomials 


sin(( d +1/2 ) (x-y) )/sin( (x- 

y)/2) 



10 Table 2: Examples of kernel functions 

In Table 2 the last two kernels are one dimensional i.e. for 
the case when X has only one column. Multivariate versions 
can be derived from products of these kernel functions. The 

15 definition of B 2n +i can be found in De Boor (1978 ) . Use of a 
kernel function in either a generalised linear model or a 
quasi likelihood model results in mean values which are 
smooth (as opposed to transforms of linear) functions of the 
covariates X. Such models may give a substantially better 

20 fit to the data. 
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A fourth embodiment relating to a proportional hazards model 
will now be described. 

5 D. Proportional Hazard Models 

The method of this embodiment may utilise training samples 
in order to identify a subset of components which are 
capable of affecting the probability that a defined event 

10 (eg death, recovery) will occur within a certain t im e 

period. Training samples are obtained from a system and the 
time measured from when the training sample is obtained to 
when the event has occurred. Using a statistical method to 
associate the time to the event with the data obtained from 

15 a plurality of training samples, a subset of components may 
be identified that are capable of predicting the 
distribution of the time to the event. Subsequently, 
knowledge of the subset of components can be used for tests, 
for example clinical tests to predict for example, 

2 0 statistical features of the time to death or time to relapse 
of a disease. For example, the data from a subset of 
components of a system may be obtained from a DNA 
microarray. This data may be used to predict a clinically 
relevant event such as, for example, expected or median 

25 patient survival times, or to predict onset of certain 
symptoms, or relapse of a disease. 

In this way, the present invention identifies preferably a 
relatively small number of components which can be used to 

30 predict the distribution of the time to an event of a 

system. The selected components are "predictive" for that 
time to an event. Essentially, from all the data which is 
generated from the system, the method of the present 
invention enables identification of a small number of 

35 components which can be used to predict time to an event. 
Once those components have been identified by this method, 
the components can be used in future to predict statistical 
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features of the time to an event of a system from new 
samples. The method of the present invention preferably 
utilises a statistical method to eliminate components that 
are not required to correctly predict the time to an event 
5 of a system. By appropriate selection of the hyperparameters 
in the model some control over the size of the selected 
subset can be achieved. 

As used herein, "time to an event" refers to a measure of 
10 the time from obtaining the sample to which the method of 

the invention is applied to the time of an event. An event 
may be any observable event. When the system is a 
biological system, the event may be, for example, time till 
failure of a system, time till death, onset of a particular 
15 symptom or symptoms, onset or relapse of a condition or 
disease, change in phenotype or genotype, change in 
biochemistry, change in morphology of an organism or tissue, 
change in behaviour. 

20 The samples are associated with a particular time to an 
event from previous times to an event. The times to an 
event may be times determined from data obtained from, for 
example, patients in which the time from sampling to death 
is known, or in other words, "genuine" survival times, and 

25 patients in which the only information is that the patients 
were alive when samples were last obtained, or in other 
words, "censored" survival times indicating that the 
particular patient has survived for at least a given number 
of days. 

30 

In one embodiment, the input data is organised into am 
nxp data matrix A r = (^J with n test training samples and p 
components. Typically, p will be much greater than n . 

35 For example, consider an Nxp data matrix X^fo) from, for 
example, a microarray experiment, with N individuals (or 
samples) and the same p genes for each individual. 
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Preferably, there is associated with each individual 
i (i = 1,2.-, N) a variable^, (>/, £0) denoting the time to an 
event, for example, survival time. For each individual 
there may also be defined a variable that indicates whether 
that individual's survival time is a genuine survival time 
or a censored survival time. Denote the censor indicators 
as Cj where 



is uncensored 
is censored 



The Nxl vector with survival times y t may be written as y 
and the Nxl vector with censor indicators c, as c. 

Typically, as discussed above, the component weights are 
estimated in a manner which takes into account the a priori 
assumption that most of the component weights are zero. 

Preferably, the prior specified for the component weights is 
of the form 



*(A.A.-.A)- JTMAK )P(r, )dr (Dl) 

r M 



where are component weights, P(0i\*i) is^(0,r/Jand 

p [ T i ) is some hyperprior distribution 

that is not a Jeffrey's hyperprior. 

In one embodiment, the prior distribution is an inverse 
gamma prior for T in which tf =l/r* has an independent gamma 
distribution with scale parameter b>0 and shape parameter 
k>0 so that the density of t? is 
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7(tf,b,k)=b- , (t, 2 /b) k - 1 exp(-tf/b)/r(k) . 
It can be shown that: 
5 E{t 2 |0}= (2k+l)/(2/b + 0 2 ) ( A ) 

Equation A can be shown as follows: 
Define 

CO 

10 I(p,b,k)= J(t 2 ) p t exp(-0.5/S 2 t 2 h(t 2 ,b,k)dt 2 

0 

then 

I(p,b,k)=b ,H<)5 {r(p+k+0.5)/r(k)}(l-K).5b/3 2 )- (p+k+O5 > 



15 Proof 

Let s = p 2 /2 then 



IfeWO-V* 3 J(t 2 ^) I ^ 5 exp(-st 2 )7(t 2 ,b,k)dt 2 

0 

Now using the substitution u = t 2 /b we get 

I(p,b,k)=b^ 5 /(u)^ 5 exp(-sub>y(u,l,k)du 



20 



30 



Now let s'=bs and substitute the expression for -y(u,l,k) . This 
gives 



Ifobjc^b 1 * 05 J expHs'+l^u^ ^du / T(k) 



25 Looking up a table of Laplace transforms/ eg Abraxnowitz and 
Stegun, then gives the result. 

The conditional expectation follows from 

E{t 2 |^}=I(l,b,k)/I(0,b,k) 



= (2k+l)/(2/b+ j 8 2 ) 
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As k tends to zero and b tends to infinity, a result 
equivalent to using Jeffreys prior is obtained. For example, 
for k=0.005 and b=2*10 5 

5 

E{t 2 |0} = (1.01)/(10 5 + /3 2 ) 

Hence we can get arbitrarily close to a Jeffery's prior with 
this proper prior. 

10 The modified algorithm for this model has 

b (n) =E{t 2 | 

where the expectation is calculated as above. 

15 

In yet another embodiment, the prior distribution is 
distribution for rf g . Preferably, the gamma distribution has 
scale parameter b>0 and shape parameter k>0. 

2 0 It can be shown, that 

CO 

Ju k - 3/2, exp(-(7 i /u+u))du 

E{r*|ft>--t • 

b Ju k,/2, exp(-(7 i /u+u))du 

0 

Vbi/jjK^eVrT) 

_ 1 (2V7~)K 3/2 , t (2V70 
I ft I' K 1/2 . k (2V7~) 

where y^fif/lb and K denotes a modified Bessel function. 
25 Some special members of this class are k=l in which case 



E{r/ 2 |/3 i }=>/27b(l/|/3 i |) 
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which corresponds to the prior used in the Lasso technique , 
Tibshirani(1996) . See also Pigueiredo (2001) . 

The case k=0.5 gives 



where K 0 and Ki are modified Bessel functions, see Abramowitz 
and Stegun(1970) . Polynomial approximations for evaluating 
these Bessel functions can be found in Abramowitz and 
Stegun(1970, p379) • 

The expressions above demonstrate the connection with the 
Lasso model and the Jeffreys prior model. 

Details of the above calculations are as follows: 

For the gamma prior above and y^fif/lb 



E{t: 2 |$}= >/27b(l/|/3 i |){K,(2 > /^)/K 0 (2 > /^)} 



or equivalently 



E« 2 |ft> - (1/lftl 2 ){2 A /^K,(2V^)/K 0 (2 > /^)} 



E{r i - 2 |A>= 



oo 

J u k-3*-l exp( _ (7j/u+u))du 



0 



oo 



b Ju^'expC-frj/u +u) )du 



0 




(D2) 



where K denotes a modified Bessel function. 
For k=l in (D2) 



E{r i " 2 | j S j }=>/27b(l/| J 8 j |) 
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Por K=0.5 in (D2) 

E{r/ 2 10.}= V2^b(l/|^|){K 1 (2 A /^)/K 0 (2V^)} 
or equivalently 

E{rr 2 |ft} = (1/lftl 2 ){2V^K,(2V^)/K 0 (2V^)} 

Proof 

From the definition of the conditional expectation, writing 
y { =P?l2X>. we get 

00 

Jr i - 2 r i - , ex P (- r ,.r i - 2 )b- , (r i - 2 /b) k - , exp(r i - 2 /b)dr i 2 
E{r 5 - 2 1^}=^ 

0 

Rearranging, simplifying and making the substitution u=y?/b 
gives A.l 

The integrals in A.l can be evaluated by using the result 

>^exp[-^ + ^.jJfe=^(2a) 

where K denotes a modified Bessel function, see 
Watson (1966) . 

In a particularly preferred embodiment, p^al/rf is a 
Jeffreys prior, Kotz and Johnson (1983) . 

The likelihood function defines a model which fits the data 
based on the distribution of the data. Preferably, the 
likelihood function is of the form: 

N 

Log (Partial) Likelihood 

»=1 
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where fi T =(fl,£ 2 . ...,fi p ) and <pj =(^, % ,-,<P 9 )are the model 
parameters. The model defined by the likelihood function 
may be any model for predicting the time to an event of a 
system. 



In one embodiment, the model defined by the likelihood is 
Cox's proportional hazards model. Cox's proportional hazards 
model was introduced by Cox (1972) and may preferably be 
used as a regression model for survival data. In Cox's 
proportional hazards model, p 1 is a vector of (explanatory) 
parameters associated with the components. Preferably, the 
method of the present invention provides for the 
parsimonious selection (and estimation) from the parameters 
P 1 =(P\-P2.—.P P ) for Cox's proportional hazards model given 
the data X , y and c . 

Application of Cox's proportional hazards model can be 
problematic in the circumstance where different data is 
obtained from a system for the same survival times, or in 
other words, tied survival times. Tied survival times may 
be subjected to a pre-processing step that leads to unique 
survival times. The pre-processing proposed simplifies the 
ensuing code as it avoids concerns about tied survival times 
in the subsequent application of Cox's proportional hazards 
model . 



The pre-processing of the survival times applies by adding 
an extremely small amount of insignificant random noise. 
Preferably, the procedure is to take sets of tied times and 
add to each tied time within a set of tied times a random 
amount that is drawn from a normal distribution that has 
zero mean and variance proportional to the smallest non-zero 
distance between sorted survival times. Such pre-processing 
achieves an elimination of tied times without imposing a 
draconian perturbation of the survival times. 
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The pre-processing generates distinct survival times. 
Preferably, these times may be ordered in increasing 

magnitude denoted as t = ('(iyt {2 y -t {N) ). > . 

Denote by Z the Nxp matrix that is the re -arrangement 
of the rows of X where the ordering of the rows of 
Z corresponds to the ordering induced by the ordering of t; 
also denote by Z, the j th row of the matrix Z . Let d be the 
result of ordering c with the same permutation required to 
order t_ . 

After pre-processing for tied survival times is taken 
into account and reference is made to standard texts on 
survival data analysis (eg Cox and Oakes, 1984), the 
likelihood function for the proportional hazards model may 
preferably be written as 



TV 



'(<i>?)=n 



(D3) 



where p T =(0 l ,0 2 , - ,P n ) . Zj = the j th row of Z, and 

{i:i = JJ + l.—.N}~ the risk set at the j th ordered event 
time tn\ . 

The logarithm of the likelihood (ie L = log(l)) may 
preferably be written as 

' f w 



i=7 



ZiP-log 



2 exp(Zjp) 



JJ 



N 
1=7 



Zi/3-log 



N 



where 



y £Cijexp(z j p) 
U-i " J J 



(D4) 



. _ j0. if J <i 
iJ ~\l, i£J*i 
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Notice that the model is non-parametric in that the 
parametric form of the survival distribution is not 
specified - preferably only the ordinal property of the 
5 survival times are used (in the determination of the risk 
sets) . As this is a non-parametric case <p is not required 
(ie q=0) . 

In another embodiment of the method of the invention, the 
10 model defined by the likelihood function is a Parametric 
survival model. Preferably, in a parametric survival model, 
ft 1 is a vector of (explanatory) parameters associated with 
the components, and <p l is a vector of parameters associated 
with the functional form of the survival density function. 

15 

Preferably, the method of the invention provides 
for the parsimonious selection (and estimation) from the 
parameters fi l and the estimation of <p A =[<Pi,q>2,~',p q ) for 
parametric survival models given the data X , y and c . 

20 

In applying a parametric survival model, the survival times 
do not require pre-processing and are denoted as y . 
The parametic survival model is applied as follows: 
Denote by f{y;0,f},x} the parametric density function of the 

25 survival time, denote its survival function by 

00 

S (^ ; ?'^'^ r ) = \fi^ ;( E'^' X )^ u where <P a ^e the parameters relevant 

y 

to the parametric form of the density function and fi,X are 
as defined above. The hazard function is defined as 

30 

Preferably, the generic formulation of the log- likelihood 
function, taking censored data into account, is 
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N ( 

Reference to standard texts on analysis of survival time 
data via parametric regression survival models reveals a 
collection of survival time distributions that may be used. 
Survival distributions that may be used include , for 
example, the Weibull, Exponential or Extreme Value 
distributions . 

If the hazard function may be written as 

h(y i ;<p^ 9 X) = X(y i ;(p)exp(x i ^ then §(yi;<p,/3,x) = exp[-A(y i; <p^^ and 
f{yil<P.p,Xyx{y^ where A( yi ;<p) = ^(u;f)/u 

_ /x dA\yi;<p\ 

is the integrated hazard function and Myi>'<p)= v ; ; X± is 

V ~ } dy { 

the i th row of X. 



The Weibull, Exponential and Extreme Value distributions 
have density and hazard f vine t ions that may be written in the 
form of those presented in the paragraph immediately 
above. 



The application detailed relies in part on an algorithm of 
Altken and Clayton (1980) however it permits the user to 
specify any parametric underlying hazard function. 



Following from Aitkin and Clayton (1980) a preferred 
likelihood function which models a parametic survival model 



is: 



N 
i=l 



c ilog{ni)-Hi+Ci 



(D5) 



where ft ■ = A^y^pjexp^Xj/}) . Aitkin and Clayton (1980) note 
that a consequence of equation (11) is that the c, ' s may be 
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treatedas Poisson variates with means and that the last 
term in equation (11) does not depend on p (although it 
depends on <p ) . 

Preferably, the posterior distribution of fi , (p and T given 
y is 

p \P>v> x ~\ y - c ~) a l \y\P^) p {P\i) p Ki) (ds) 

wherein l\y\P,<P,c) is the likelihood function. 

In one embodiment, r may be treated as a vector of missing 
data and an iterative procedure used to maximise equation 
(D6) to produce a posteriori estimates of p . The prior of 
equation (Dl) is such that the maximum a posteriori 
estimates will tend to be sparse i.e. if a large number of 
parameters are redundant, many components of /? will be 
zero . 



Because a prior expectation exists that many components of 
P l are zero, the estimation may be performed in such a way 
that most of the estimated fi t ' s are zero and the remaining 
non-zero estimates provide an adequate explanation of the 
survival times . 

In the context of microarray data this exercise translates 
to identifying a parsimonious set of genes that provide an 
adequate explanation for the event times. 

As stated above, the component weights which maximise the 
posterior distribution may be determined using an iterative 
procedure. Preferable, the iterative procedure for 
maximising the posterior distribution of the components and 
component weights is an EM algorithm, such as, for example, 
that described in Dempster et al, 1977. 
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If the E step of the EM algorithm is examined, from (D6) 
ignoring terms not involving beta, it is necessary to 
compute 

i=l 

A - (D7) 

i=l 

where d ( =d l (P l ) = EQ/vf \ #}-° 5 and for convenience we define 
d, = \l d f = 0 if p. = 0. In the following we write d = d(0)=(d l ,d 2 ,...,d n ) T . 
In a similar manner we define for example d(fi w ) and 
d(r (a) )=P?d(P n r in) ) where = />„/»> and P„ is obtained from the p 
by p identity matrix by omitting columns j for which y3< n) =0. 

Hence to do the E step requires the calculation of the 
conditional expected value of t^l/T* when p[p,\zf) is w(o,T, 2 ) 

and />(r 2 )has a specified prior distribution as discussed 
above. 

In one embodiment, the EM algorithm comprises the steps: 

1. Choose the hyperprior and values for its parameters, 
namely b and k. Initialising the algorithm by setting n=0. 

So = (1,2,„, p } , initialise fit® = p* , <p(°) , 

2 . Defining 



0, otherwise 



and let P n be a matrix of zeroes and ones such that the 
nonzero elements y yi) of pV> satisfy 



r=PlP_ , js=P„r 



-r>Tp a n _ (D8) 
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3. Performing an estimation step by calculating the expected 
value of the posterior distribution of component weights. 
This may be performed using the function: 

where L is the log likelihood function of y . Using 
P = P„y and p™ = P nY ™ we have 

Q{y\r}"\<P {n) ) = L{tjp a y,<p^)- l ~y T ^d(y^)) r (D io) 

4. Performing the maximisation step. This may be 
performed using Newton Raphson iterations as follows: 

Set y 0 =/ r )and for r=0,l,2,.„ y r+l = y r + a r £. where a r is 

chosen by a line search algorithm to ensure Q^y r+1 | y W ,0> w ) > 
Q(r r I ? ( ' 

^=A( < /(y(-)))[-A( < /( r <->))^-AW>))+/]'(-^L &_-) 

" a £ " d lr d(r in) r 

Let y be the value of y r when some convergence 
criterion is satisfied e.g \\y r _y r+l \\ < e (for example * = 1(T 5 ) 



5. Define P*=P n y\ S„ = 



i:\Pi | >e x max\pj |i where £, is a small 



7 



cons 



tant, say 10" 5 . Set n=n+l, choose ^ n+1 ) =? (") + ^ 
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dL(y\P n y\<p) 

where tp satisfies ~ = 0 and K n is a damping factor 

og> 

such that 0<K n <1 . 

6. Check convergence. If ->A'V ||<^ 2 where s 2 is suitably 

small then stop, else go to step 2 above. 

In another embodiment, step (Dll) in the maximisation step 

may be estimated by replacing with its expectation 

d 7 r 

a 2 7 r 



10 In one embodiment, the EM algorithm is applied to maximise 
the posterior distribution when the model is Cox's 
proportional hazard's model. 

To aid in the exposition of the application of the EM 
15 algorithm when the model is Cox's proportional hazards 
model, it is preferred to define "dynamic weights" and 
matrices based on these weights. The weights are - 

w u=nt x —^—> 

1 ECijexp(Zjfi) 

7=1 
N 

20 

Matrices based on these weights are - 
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W = 



#2 



• 0 N 

0 ••• w* N j 



In terms of the matrices of weights the first and 
second derivatives of L may be written as 



dL t - 
~ = Z T W 
d0 



d 2 L 
dfi 2 



=z t (w"-a(w*))z = z t kz 



(D12) 



where ti=W -A\W j. Note therefore from the transformation 
matrix P n described as part of Step (2) of the EM algorithm 
(Equation D8) (see also Equations (Dll) ) it follows that 



dy r dp, " 

dr 2 r " dp, 2 " K * r 



-A(W'))ZP„ =P;Z T KZP n 



(D13) 
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Preferably, when the model is Cox's proportional hazards 
model the E step and M step of the EM algorithm are as 
follows : 

1. Choose the hyperprior and its parameters b and k. Set 
n=0, 3 0 = (1,2,'", p ), L et v be the vector with 
components 

' U , if c,=0 

for some small 6 , say .001. Define f to be log(v/t) 
If p £ N compute initial values 0" by 

/f=(z T z+uyz T f 

If p > N compute initial values /?*by 

/f =j(/ -Z T (ZZ T +Al) l Z)Z T f 
where the ridge parameter X satisfies 0 < X £ 1. 
2 . Define 

0, otherwise 

Let /> be a matrix of zeroes and ones such that the nonzero 
elements yWof /?(") satisfy 

r = /»//? , P =P n y 
3. Perform the E step by calculating 
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Q{p\pW) = *{tog(p(£. ? ,T|i))|£,£«} 

where Lis the log likelihood function of £ given by 
Equation (8) . Using = P n y and 0™ = P n /"> we have 

Q(r\/ n) ) = xa |/fc)-^AWr"»r 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y 0 =j^ r >and for r=0,l,2,.„ 

Zr+\ ~ Yr + a r §r where a r is chosen by a line search 
algorithm to ensure Q(y r+l I /"W'J >Q[y T \ y yn> ,<p yn > ) • 
For p g N use 



S r = A[d(y} n ))}{Y T KY + l)~ l {Y T W- a[\/ d(y} n ) jjy). 
where Y = ZP n A [d(y} n ) ;J . 



For p > N use 



Let y* be the value of y r when some convergence 
criterion is satisfied e.g | | y r - y r+1 | J < c (for example 
10- 5 ) . 



5. Define /?* P n y , S„ | > £l max\/3j\^ where ^ is a 

small constant, say 10" 5 . This step eliminates variables 
with very small coefficients. 

6. Check convergence. if \\/-y^\\<s 2 where e 2 is suitably 
small then stop, else set n=n+l, go to step 2 above and 
repeat procedure until convergence occurs. 
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In another embodiment the EM algorithm is applied to 
maximise the posterior distribution when the model is a 
parametric survival model. 



In applying the EM algorithm to the parametic survival 
model, a consequence of equation (11) is that the Ci's may 
be treated as Poisson variates with means fi i± and that the 
last term in equation (11) does not depend on /? (although 
it depends on <p) . Note that log{/Ji)^log(A\ < y i ;ipjj + X i J3 and so 
it is possible to couch the problem in terms of a log- linear 
model for the Poisson- like mean. Preferably , an iterative 
maximization of the log- likelihood function is performed 
where given initial estimates of <p the estimates of p are 
obtained. Then given these estimates of fi , updated 
estimates of q> are obtained. The procedure is continued 
until convergence occurs. 

Applying the posterior distribution described above, we 
note that (for fixed <p) 

dp f* dp d/3 dfi and dp ~ Xi (D14> 

Consequently from Equations (11) and (12) it follows that 

The versions of Equation (12) relevant to the parametric 
survival models are 



0 



(D15) 



To solve for <p after each M step of the EM algorithm (see 
step 5 below) preferably put ^ n+1 ) = + *:„ -pMj where tp* 
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dL 

satisfies — = 0 for 0<tc n <\ and P is fixed at the value 

oq> 

obtained from the previous M step. 

It is possible to provide an EM algorithm for parameter 
selection in the context of parametric survival models and 
5 microarray data. Preferably, the EM algorithm is as 
follows : 

1. Choose a hyper prior and its parameters b and k eg b=le7 
and k=0.5. Set n=0, S 0 = fl,2,-.-, p} q>( initial ) = . Let v be 
10 the vector with components 



v = / 1_e ' *f c ' =1 
V i *e , if 



if c ( =0 



for some small 8 , say for example .001. Define f to be 
15 log(v/A( y/ <p)) . 

If p £ N compute initial values J3* by ft* = (X T X + XI) A X T f . 
If p > N compute initial values /?* by 

/T=j(7 -X T {XX T +XI) A X)X T f 

2 0 where the ridge parameter X satisfies 0 < k <> 1. 

2. Define 

_ f Pi y i e S n 
0, otherwise 

Let P n be a matrix of zeroes and ones such that the nonzero 
25 elements y^ot satisfy 



r in) - pit" ■ 



00 
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3 . Perform the E step by calculating 

GC?I£V">) = £{tog(*(A ? ,r|>.))| ? ,>8<V">} 

where Lis the log likelihood function of y and <p("K 

Using p = P n y and /?<"> = P n y<"> we have 
g(y|yW, ? <">) = Liy\P n r,9 (n) )-^y T ^{d(r w ))r 

4. Do the M step. This can be done with Newton Raphson 
iterations as follows. Set y 0 =y^ and. for r=0,l,2,... 
Yr+\ = Y r + «r €r where a r is chosen by a line search 
algorithm to ensure | r W .£ W ) >Q(r r | r W ,£ W ) . 

For p g N use 

S r = -A(d(y"))[Y;A(v)Y n +/T»C^(c- tf )- A(l/d(yM))y) 

where r = JKP a A(rf(£ W )). 
For p > N use 

&=-^(^W;)(/-r r (^^(y ? ))-V)( 1 f(c- i ,)- 4 (^(»)^] 

Let y* be the value of y r when some convergence 
criterion is satisfied e.g | | y r - y r+1 | | < e (for example 
10" 5 ) . 



5. Define 0*=P„/, S n = 



| >£ X max\fij\\ where e x is a small 



constant, saylO" 5 . Set n=n+l # choose <p( n+1 ) = ? ( n ) + Kfi [<p* -pC") j 
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. s . dL (y\Pn/ ,<p) 

where <p satisfies ^— : ^ = 0 and * n is a dating 
factor such that 0</c„ <1 . 

6. Check convergence. If ||< * 2 where s 2 is suitably 

small then stop, else go to step 2. 

In another embodiment, survival times are described by a 
Weibull survival density function. For the Weibull case <p 
is preferably one dimensional and 

A(y;<p) = y a , 

X(y;<p) = ay a -\ 
cp-a 

Preferably, jL = K + j?( c . - Mi )log( yi ) = Q is solved 

i'=l 

after each M step so as to provide an updated value of a . 

Following the steps applied for Cox's proportional hazards 
model, one may estimate a and select a parsimonious subset 
of parameters from p that can provide an adequate 
explanation for the survival times if the survival times 
follow a Weibull distribution. A numerical example is now 
given . 

The preferred embodiment of the present invention will now 
be described by way of reference only to the following non- 
limiting example. It should be understood, however, that 
the example is illustrative only, should not be taken in any 
way as a restriction on the generality of the invention 
described above. 



Example : 
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Full normal regression example 201 data points 41 basis 
functions 

k=0 and b=le7 

the correct four basis functions are identified namely 
2 12 24 34 

estimated variance is 0.67. 

With k=0 .2 and b=le7 
10 eight basis functions are identified, namely 
2 8 12 16 19 24 34 

estimated variance is 0.63. Note that the correct set of 
basis functions is included in this set. 

15 The results of the iterations for k=0.2 and b=le7 are given 
below. 

EM Iteration: 0 expected post: 2 basis fns 41 

20 sigma squared 0.6004567 

EM Iteration: 1 expected post: -63.91024 basis fns 41 

sigma squared 0.6037467 

EM Iteration: 2 expected post: -52.76575 basis fns 41 

25 

sigma squared 0.6081233 

EM Iteration: 3 expected post: -53.10084 basis fns 30 

sigma squared 0.6118665 
30 EM Iteration: 4 expected post: -53.55141 basis fns 22 

sigma squared 0.6143482 

EM Iteration: 5 expected post: -53.79887 basis fns 18 



35 sigma squared 0.6155 

EM Iteration: 6 expected post: -53.91096 basis fns 18 
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sigma squared 0, 6159484 

EM Iteration: 7 expected post: 



•53.94735 basis fns 16 



sigma squared 0.6160951 

EM Iteration: 8 expected post: 



-53.92469 basis fns 14 



sigma squared 0.615873 

EM Iteration: 9 expected posts 



-53.83668 basis fns 13 



sigma squared 0.6156233 

EM Iteration: 10 expected post; 



-53.71836 basis fns 13 



sigma squared 0.6156616 

Iteration: 11 expected post; 



•53.61035 basis fns 12 



sigma squared 0.6157966 

EM Iteration: 12 expected post: 



-53.52386 basis fns 12 



sigma squared 0.6159524 

EM Iteration: 13 expected post; 



-53.47354 basis fns 12 



sigma squared 0.6163736 

EM Iteration: 14 expected post: 



-53.47986 basis fns 12 



sigma squared 0.6171314 

EM Iteration: 15 expected post; 

sigma squared 0.6182353 

Iteration: 16 expected post: 



-53.53784 basis fns 11 



•53.63423 basis fns 11 



sigma squared 0.6196385 

EM Iteration: 17 expected post; 



-53.75112 basis fns 11 



sigma squared 0.621111 

EM Iteration: 18 expected post; 



-53.86309 basis fns 11 



sigma squared 0.6224584 
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EM Iteration: 19 expected post: -53.96314 basis fns 11 
sigma squared 0.6236203 

EM Iteration: 20 expected post: -54.05662 basis fns 11 
sigma squared 0.6245656 

EM Iteration: 21 expected post: -54.1382 basis fns 10 
sigma squared 0.6254182 

EM Iteration: 22 expected post: -54.21169 basis fns 10 
sigma squared 0.6259266 

EM Iteration: 23 expected post: -54.25395 basis fns 9 
sigma squared 0.6259266 

EM Iteration: 24 expected post: -54.26136 basis fns 9 
sigma squared 0.6260238 

EM Iteration: 25 expected post: -54.25962 basis fns 9 
sigma squared 0.6260203 

EM Iteration: 26 expected post: -54.25875 basis fns 8 
sigma squared 0.6260179 

EM Iteration: 27 expected post: -54.25836 basis fns 8 
sigma squared 0.626017 

EM Iteration: 28 expected post: -54.2582 basis fns 8 
sigma squared 0.6260166 

For the reduced data set with 201 observations and 10 
variables, k=0 and b=le7 

Gives the correct basis functions, namely 12 3 4. With 
k=0.25 and b=le7, 7 basis functions were chosen, namely 1 2 
3 4 6 8 9. A record of the iterations is given below. 
Note that this set also includes the correct set. 
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EM Iteration: 0 expected post: 2 basis fns 10 

sigma squared 0.6511711 
5 EM Iteration: 1 expected post: -66.18153 basis fns 10 

sigma squared 0.6516289 

EM Iteration: 2 expected post: -57.69118 basis fns 10 

10 sigma squared 0.6518373 

EM Iteration: 3 expected post: -57.72295 basis fns 9 

sigma squared 0.6518373 

EM Iteration: 4 expected post: -57.74616 basis fns 8 

15 

sigma squared 0.65188 

EM Iteration: 5 expected post: -57.75293 basis fns 7 
sigma squared 0.6518781 

20 Ordered categories examples 

Luo prostate data 15 samples 9605 genes. For k=0 and b=le7 
we get the following results 

misclassification table 
2 5 pred 

y 12 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 
30 4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 

35 For k=0.25 and b=le7 we get the following results 



misclassification table 
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pred 
y 1 2 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 7188 

Note that we now have perfect discrimination on the training 
data with the addition of one extra variable. A record of 
the iterations of the algorithm is given below. 

15 *********************************************** 
Iteration 1 : n cycles, criterion -4.661957 

misclassif ication matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =true class 



25 Class 1 Number of basis functions in model : 9608 

Iteration 2 s 5 cycles, criterion -9.536942 

misclassif ication matrix 
30 fhat 

f 12 

1 23 0 

2 1 21 

row =true class 

35 

Class 1 Number of basis functions in model : 6431 
**************************************** 
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Iteration 3 : 4 cycles, criterion -9.007843 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 Number of basis functions in model : 508 
************************************* ********** 

Iteration 4 : 5 cycles, criterion -6.47555 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row = true class 

Class 1 Number of basis functions in model x 62 
***************** ****************************** 

Iteration 5 : 6 cycles, criterion -4.126996 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 Number of basis functions in model : 20 
****************************** ***************** 

Iteration 6 : 6 cycles, criterion -3.047699 

misclassification matrix 
fhat 
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f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 Number of basis functions in model : 12 
**************** ******************************* 

Iteration 7 : 5 cycles, criterion -2.610974 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 408 846 6614 7191 8077 
regression coefficients 

28.81413 14.27784 7.025863 -1 . 086501e-06 4.553004e-09 - 
16.25844 0.1412991 -0.04101412 

******************* **************************** 
Iteration 8 : 5 cycles, criterion -2.307441 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 1 21 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

32.66699 15.80614 7.86011 -18.53527 0.1808061 -0.006728619 
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************* ********************************** 

Iteration 9 : 5 cycles, criterion -2.028043 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 t Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

36.11990 17.21351 8.599812 -20.52450 0.2232955 -0.0001630341 

*********************************************** 
Iteration 10 : 6 cycles, criterion -1.808861 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 



Class 1 : Variables left in model 
12 3 6614 7191 8077 
regression coefficients 

39.29053 18.55341 9.292612 -22.33653 0.260273 -8 . 696388e-08 

*********************************************** 
Iteration 11 : 6 cycles, criterion -1.656129 



misclassif ication matrix 

fhat 
f 12 
1 23 0 



WO 2004/104856 



PCT/AU2004/000696 



- 93 - 

2 0 22 
row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

42.01569 19.73626 9.90312 -23.89147 0.2882204 

Iteration 12 : 6 cycles, criterion -1.554494 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

44.19405 20.69926 10.40117 -25.1328 0.3077712 
Iteration 13 : 6 cycles, criterion -1.487778 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

45.84032 21.43537 10.78268 -26.07003 0.3209974 
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Iteration 14 s 6 cycles, criterion -1.443949 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row strue class 

Class 1 t variables left in model 
12 3 6614 7191 
regression coefficients 

47.03702 21.97416 11.06231 -26.75088 0.3298526 

•************•**•***••**•*******„ 
Iteration 15 » 6 cycles, criterion -1.415 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

47.88472 22.35743 11.26136 -27.23297 0.3357765 

*****************************.********^^^ # 
Iteration 16 s 6 cycles, criterion -1.395770 

misclassif ication matrix 

fhat 
f 12 
1 23 0 
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2 0 22 
row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

48.47516 22.62508 11.40040 -27.56866 0.3397475 

********************* ************************** 
Iteration 17 : 5 cycles, criterion -1.382936 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

48.88196 22.80978 11.49636 -27.79991 0.3424153 

**************** ******************************* 
Iteration 18 : 5 cycles, criterion -1.374340 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.16029 22.93629 11.56209 -27.95811 0.3442109 
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*********************** ************************ 

Iteration 19 : 5 cycles, criterion -1.368567 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row strue class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.34987 23.02251 11.60689 -28.06586 0.3454208 

************************** ********************* 
Iteration 20 : 5 cycles, criterion -1.364684 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row = true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.47861 23.08109 11.63732 -28.13903 0.3462368 

*********************************************** 
Iteration 21 : 5 cycles, criterion -1.362068 

misclassif ication matrix 
fhat 
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1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.56588 23.12080 11.65796 -28.18862 0.3467873 

*********************************************** 
Iteration 22 : 5 cycles, criterion -1.360305 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.62496 23.14769 11.67193 -28.22219 0.3471588 

****************************************** ***** 
Iteration 23 : 4 cycles, criterion -1.359116 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 
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49.6649 23.16588 11.68137 -28.2449 0.3474096 

*********************************************** 
Iteration 24 : 4 cycles, criterion -1.358314 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row atrue class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.69192 23.17818 11.68776 -28.26025 0.3475789 

*********************************************** 
Iteration 25 ; 4 cycles, criterion -1.357772 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.71017 23.18649 11.69208 -28.27062 0.3476932 

*********************************************** 
Iteration 26 ; 4 cycles, criterion -1.357407 

misclassification matrix 
fhat 
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f 12 

1 23 0 

2 0 22 

row = true class 

5 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.72251 23.19211 11.695 -28.27763 0.3477704 

10 

************************ *********************** 
Iteration 27 : 4 cycles, criterion -1.35716 

misclassif ication matrix 
15 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

20 

Class 1 : Variables left in model 
12 3 6614 7191 
regression coefficients 

49.73084 23.19590 11.69697 -28.28237 0.3478225 

25 

************************************* ********** 
Iteration 28 : 3 cycles, criterion -1.356993 

misclassif ication matrix 
3 0 fhat 

f 1 2 

1 23 0 

2 0 22 

row = true class 

35 

Class 1 : Variables left in model 
12 3 6614 7191 
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regression coefficients 

49.73646 23.19846 11.6983 -28.28556 0.3478577 

**************************************** ******* 
5 Iteration 29 : 3 cycles, criterion -1.356881 

misclassif ication matrix 

fhat 
f 12 
10 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
15 12 3 6614 7191 

regression coefficients 

49.74026 23.20019 11.6992 -28.28772 0.3478814 

*********************************************** 
20 Iteration 30 : 3 cycles, criterion -1.356805 

misclassif ication matrix 

fhat 
f 12 
25 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
30 12 3 6614 7191 

regression coefficients 

49.74283 23.20136 11.69981 -28.28918 0.3478975 
1 

35 

misclassif ication table 
pred 
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y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
6611 7188 

Ordered categories example 

Luo prostate data 15 samples 50 genes. For k=0 and b=le7 we 
get the following results 

misclassif ication table 

pred 
y 12 3 4 

1 4 0 0 0 

2 0 2 1 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
1 

Por k=0.25 and b=le7 we get the following results 

misclassif ication table 

pred 
y 12 3 4 

1 4 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
1 42 
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A record of the iterations for k=0.25 and b=le7 is given 
below 

*********************************************** 

5 Iteration 1 : 19 cycles, criterion -0.4708706 

misclassif ication matrix 

fhat 
f 12 
10 1 23 0 

2 0 22 
row =true class 

Class 1 Number of basis functions in model : 53 
15 *********************************************** 

Iteration 2 : 7 cycles, criterion -1.536822 

misclassif ication matrix 
fhat 
20 f 12 

1 23 0 

2 0 22 

row =tme class 

25 Class 1 Number of basis functions in model : 53 
*********************************************** 

Iteration 3 : 5 cycles, criterion -2.032919 

misclassif ication matrix 
30 fhat 

f 12 

1 23 0 

2 0 22 

row =true class 

35 

Class 1 Number of basis functions in model : 42 
*********************************************** 
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Iteration 4 : 5 cycles, criterion -2.132546 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row = true class 

Class 1 Number of basis functions in model : 18 
*************************** ******************** 

Iteration 5 : 5 cycles, criterion -1.978462 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 Number of basis functions in model : 13 
******************************* **************** 

Iteration 6 : 5 cycles, criterion -1.668212 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
regression coefficients 

34.13253 22.30781 13.04342 -16.23506 0.003213167 0.006582334 
-0.0005943874 -3.557023 
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********************** *************** ****** #ik## 
Iteration 7 : 5 cycles, criterion -1.407871 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 10 41 43 45 
regression coefficients 

36.90726 24.69518 14.61792 -17.16723 1.112172e-05 5.949931e- 
06 -3.892181e-08 -4.2906 

*********************** ************************ 
Iteration 8 s 5 cycles, criterion -1.244166 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 10 45 
regression coefficients 

39.15038 26.51011 15.78594 -17.99800 1.125451e-10 -4.799167 

****************************************** w ** w# 
Iteration 9 : 5 cycles, criterion -1.147754 



misclassif ication matrix 
fhat 
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1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

40.72797 27.73318 16.56101 -18.61816 -5.115492 

****************** ***************************** 
Iteration 10 : 5 cycles, criterion -1.09211 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

41.74539 28.49967 17.04204 -19.03293 -5.302421 

************************* ********************** 
Iteration 11 : 5 cycles, criterion -1.060238 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 
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42.36866 28.96076 17.32967 -19.29261 -5.410496 

************************** ********************* 
Iteration 12 s 5 cycles, criterion -1.042037 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row = true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

42.73908 29.23176 17.49811 -19.44894 -5.472426 

************************** ********************* 
Iteration 13 : 5 cycles, criterion -1.031656 

misclassif ication matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 s Variables left in model 
1 2 3 4 45 

regression coefficients 

42.95536 29.38894 17.59560 -19.54090 -5.507787 

*********************** ************************ 
Iteration 14 : 4 cycles, criterion -1.025738 

misclassif ication matrix 
fhat 
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£ 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.08034 29.47941 17.65163 -19.59428 -5.527948 

****** ***************************************** 
Iteration 15 : 4 cycles, criterion -1.022366 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row = true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.15213 29.53125 17.68372 -19.62502 -5.539438 

********************************* ************** 
Iteration 16 : 4 cycles, criterion -1.020444 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row strue class 

Class 1 : Variables left in model 
1 2 3 4 45 



WO 2004/104856 



PCT/AU2004/000696 



- 108 - 

regression coefficients 

43.19322 29.56089 17.70206 -19.64265 -5.545984 

*********************************************** 
5 Iteration 17 : 4 cycles, criterion -1.019349 

misclassif ication matrix 

fhat 
f 12 
10 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
15 1 2 3 4 45 

regression coefficients 

43.21670 29.57780 17.71252 -19.65272 -5.549713 

*********************************************** 
20 Iteration 18 : 3 cycles, criterion -1.018725 

misclassif ication matrix 

fhat 
f 12 
25 1 23 0 

2 0 22 
row =true class 

Class 1 : Variables left in model 
30 1 2 3 4 45 

regression coefficients 

43.23008 29.58745 17.71848 -19.65847 -5.551837 

*********************************************** 
35 Iteration 19 : 3 cycles, criterion -1.01837 



misclassif ication matrix 
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fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.23772 29.59295 17.72188 -19.66176-5.553047 

*********************************************** 
Iteration 20 : 3 cycles, criterion -1.018167 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.24208 29.59608 17.72382 -19.66363 -5.553737 

**************** ******************************* 
Iteration 21 : 3 cycles, criterion -1.018052 

misclassification matrix 

fhat 
f 12 

1 23 0 

2 0 22 

row =true class 

Class 1 : Variables left in model 
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1 2 3 4 45 

regression coefficients 

43.24456 29.59787 17.72493 -19.66469 -5.55413 

5 *********************************************** 
Iteration 22 : 3 cycles, criterion -1.017986 

misclassif ication matrix 
fhat 
10 f 12 

1 23 0 

2 0 22 

row = true class 

15 Class 1 : Variables left in model 
1 2 3 4 45 

regression coefficients 

43.24598 29.59889 17.72556 -19.6653 -5.554354 
20 1 

misclassif ication table 

pred 
y 12 3 4 
25 14 0 0 0 

2 0 3 0 0 

3 0 0 4 0 

4 0 0 0 4 

Identifiers of variables left in ordered categories model 
30 1 42 



